Gallery
AI4Bharat
Share
Explore
AI4Bharat Admin

icon picker
Planning

Goals for 2022-23

G1: Change the Indian language tech landscape with the open data contributions as DMU, NLTM <- Bhasini/ULCA/HuggingFace
G1.1:
G2: Become the de-facto provider of technology stack for translation <- National Translation Mission
G2.1:

G3: Become the de-facto provider of technology stack for language speech transcription <- BIRD, Prasar Bharati, OTT players, KCC
G4: Provide the first production-ready speech-to-speech translation engine <- PM Office(?)
G5: Digitize 1,000,000 books in Indian languages <- archive.org (?)

Tasks
Track
Task type
Task sub-type
Deadline
Particulars
Lead
Status
Details
Contributes to
March 2022
100 hours of data per language collected with DesiCrew on Karya
April 2022
Analysis of mined data from NewsOnAir
April 2022
Insert 100 hours of DC data into ULCA
April 2022
Models with DC data, with cleaned up text and inverse text normalization
April 2022
Based on DC data and newsonair mining
May 2022
Collect data with Desicrew in x languages
May 2022
Partnership with Prasar Bharati
May 2022
SUPERB benchmark
May 2022
Interface to collect data on Shoonya
June 2022
Chitralekha UI in Shoonya
June 2022
Make IndicTrans real-time on device
July 2022
BIRD project kickoff
July 2022
Dataset types for language id, speaker id, and other SUPERB tasks
October 2022
Specialize models for domains
December 2022
Voice based text input for Android
April 2022
Samanantar v2
April 2022
CIIL / NTM
May 2022
IndicTrans based on Samanantar v2
June 2022
Integrate NMT toolflow in Shoonya
March 2022
Data mined from IndicCorp, Wikipedia, etc.
March 2022
Data collected on Karya with DesiCrew
April 2022
Submit paper to TACL
April 2022
Release on HuggingFace
May 2022
Integrate XLit models in Shoonya
May 2022
Both training data and Akshantara benchmark
May 2022
Create model type and upload
December 2022
Keyboard app for Android
April 2022
IndicCorp v2 in more languages
May 2022
100,000 source native sentences in all languages
June 2022
Dialog data collection
April 2022
IndicBERT v2
May 2022
NER
June 2022
Profane speech keywords across languages
April 2022
Interface for document based extraction in Shoonya
May 2022
Screen text data
June 2022
OCR bounding box interface in Shoonya
July 2022
Document dataset generation
July 2022
Shoonya backend for uploading the data
December 2022
Data in all languages
January 2023
Upload all data to ULCA
March 2023
Models for all languages
July 2023
Speech to speech translation

Research Areas
Search
Multilingual Modeling
How do we go beyond current transfer architectures and improve transfer across languages? How to reduce language divergences in the multiNMT architecture? Better multilingual transfer for generation tasks like one to many translation What are the language-specific biases in current multilingual modeling approaches, which can be reduced to enable better multilingual transfer.
Self-Supervised Learning
Pre-training models have driven progress in AI. Some important directions to explore Additional pre-training loss functions for different NLG applications: Alternative pre-training objectives, data augmentation, connection to information theory, non-contrastive objective functions for faster training. Look at work in vision domain. Combining supervised and unsupervised data. In this context, understand when and how pre-training helps. Limited finetuning to optimize computational budget
Training with Noisy data
Across projects, we are relying on mined data for training our AI models. The noisy nature of these datasets is a fact of life. How do we train our models to perform better the face of such noisy training data. Repairing noisy training data Training objectives that take noise into account Knowledge Distillation Building Indic language specific semantic models (like LaBSE) as well as divergent semantic models
AI for extremely low-resource languages
Modern AI relies on large amount of data (raw or annotated). For low-resource languages beyond top-12 Indian languages, novel methods have to be developed. Pre-training for low-resources languages Utilizing dictionaries and other lexical resources Better use of language relatedness Better zeroshot transfer
Translation between Indian languages
While the major focus has been on translation between English and Indian languages, this is also an important need by itself as well as its potential to improve English ←→ IL translation for low-resource languages. The area has been under-investigated and some directions to explore include: Utilizing similarity between Indian languages Models combining all translation directions Multi-source translation systems
Efficient Models
Multilingual Modeling
Modality
Description
How do we go beyond current transfer architectures and improve transfer across languages?
How to reduce language divergences in the multiNMT architecture?
Better multilingual transfer for generation tasks like one to many translation
What are the language-specific biases in current multilingual modeling approaches, which can be reduced to enable better multilingual transfer.

Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.