AI4Bharat

Explore

Gallery

AI4Bharat

AI4Bharat Admin

Planning

Goals for 2022-23

G1: Change the Indian language tech landscape with the open data contributions as DMU, NLTM <- Bhasini/ULCA/HuggingFace

G1.1:

G2: Become the de-facto provider of technology stack for translation <- National Translation Mission

G2.1:

G3: Become the de-facto provider of technology stack for language speech transcription <- BIRD, Prasar Bharati, OTT players, KCC

G4: Provide the first production-ready speech-to-speech translation engine <- PM Office(?)

G5: Digitize 1,000,000 books in Indian languages <- archive.org (?)

Tasks

Tasks

Track

Task type

Task sub-type

Deadline

Particulars

Lead

Status

Details

Contributes to

ASR

Data

Collection

March 2022

100 hours of data per language collected with DesiCrew on Karya

Done

Data

Mining

April 2022

Analysis of mined data from NewsOnAir

Ongoing

Data

ULCA

April 2022

Insert 100 hours of DC data into ULCA

Planned

Models

Benchmark

April 2022

Models with DC data, with cleaned up text and inverse text normalization

Ongoing

Dissemination

Paper

April 2022

Based on DC data and newsonair mining

Ongoing

Data

Collection

May 2022

Collect data with Desicrew in x languages

Planned

Data

Partner

May 2022

Partnership with Prasar Bharati

Ongoing

Data

Release

May 2022

SUPERB benchmark

Planned

Tools

Shoonya

May 2022

Interface to collect data on Shoonya

Planned

Tools

Shoonya

June 2022

Chitralekha UI in Shoonya

Planned

Models

Edge models

June 2022

Make IndicTrans real-time on device

Ongoing

Tools

Partner

July 2022

BIRD project kickoff

Planned

Tools

ULCA

July 2022

Dataset types for language id, speaker id, and other SUPERB tasks

Planned

Models

Research

October 2022

Specialize models for domains

Planned

App

Release

December 2022

Voice based text input for Android

Planned

NMT

Data

Mining

April 2022

Samanantar v2

Ongoing

Data

Partner

April 2022

CIIL / NTM

Ongoing

Models

Release

May 2022

IndicTrans based on Samanantar v2

Planned

Tools

Shoonya

June 2022

Integrate NMT toolflow in Shoonya

Planned

XLit

Data

Mining

March 2022

Data mined from IndicCorp, Wikipedia, etc.

Done

Data

Collection

March 2022

Data collected on Karya with DesiCrew

Done

Dissemination

Paper

April 2022

Submit paper to TACL

Ongoing

Models

Release

April 2022

Release on HuggingFace

Ongoing

Tools

Shoonya

May 2022

Integrate XLit models in Shoonya

Planned

Data

ULCA

May 2022

Both training data and Akshantara benchmark

Planned

Models

ULCA

May 2022

Create model type and upload

Planned

App

Release

December 2022

Keyboard app for Android

Planned

Corpus

Data

Curation

April 2022

IndicCorp v2 in more languages

Ongoing

Data

Curation

May 2022

100,000 source native sentences in all languages

Ongoing

Dialog

Data

Collection

June 2022

Dialog data collection

Ongoing

NLU

Data

Curation

April 2022

IndicBERT v2

Ongoing

Data

Mining

May 2022

NER

Ongoing

Data

Collection

June 2022

Profane speech keywords across languages

Planned

OCR

Tools

Shoonya

April 2022

Interface for document based extraction in Shoonya

Ongoing

Data

Mining

May 2022

Screen text data

Ongoing

Tools

Shoonya

June 2022

OCR bounding box interface in Shoonya

Planned

Data

Mining

July 2022

Document dataset generation

Ongoing

TTS

Tools

Shoonya

July 2022

Shoonya backend for uploading the data

Planned

Data

Collection

December 2022

Data in all languages

Planned

Data

ULCA

January 2023

Upload all data to ULCA

Planned

Models

Release

March 2023

Models for all languages

Planned

Models

Research

July 2023

Speech to speech translation

Planned

NLG

Models

Data

⁠

Research Areas

Research Areas

Multilingual Modeling

How do we go beyond current transfer architectures and improve transfer across languages? How to reduce language divergences in the multiNMT architecture? Better multilingual transfer for generation tasks like one to many translation What are the language-specific biases in current multilingual modeling approaches, which can be reduced to enable better multilingual transfer.

Self-Supervised Learning

Pre-training models have driven progress in AI. Some important directions to explore Additional pre-training loss functions for different NLG applications: Alternative pre-training objectives, data augmentation, connection to information theory, non-contrastive objective functions for faster training. Look at work in vision domain. Combining supervised and unsupervised data. In this context, understand when and how pre-training helps. Limited finetuning to optimize computational budget

Training with Noisy data

Across projects, we are relying on mined data for training our AI models. The noisy nature of these datasets is a fact of life. How do we train our models to perform better the face of such noisy training data. Repairing noisy training data Training objectives that take noise into account Knowledge Distillation Building Indic language specific semantic models (like LaBSE) as well as divergent semantic models

AI for extremely low-resource languages

Modern AI relies on large amount of data (raw or annotated). For low-resource languages beyond top-12 Indian languages, novel methods have to be developed. Pre-training for low-resources languages Utilizing dictionaries and other lexical resources Better use of language relatedness Better zeroshot transfer

Translation between Indian languages

While the major focus has been on translation between English and Indian languages, this is also an important need by itself as well as its potential to improve English ←→ IL translation for low-resource languages. The area has been under-investigated and some directions to explore include: Utilizing similarity between Indian languages Models combining all translation directions Multi-source translation systems

Efficient Models

Multilingual Modeling

Modality

Text

ASR

TTS

Description

How do we go beyond current transfer architectures and improve transfer across languages?

How to reduce language divergences in the multiNMT architecture?

Better multilingual transfer for generation tasks like one to many translation

What are the language-specific biases in current multilingual modeling approaches, which can be reduced to enable better multilingual transfer.

⁠

Gallery

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.