AI4Bharat
Share
Explore
AI4Bharat Admin

icon picker
Planning

Goals for 2022-23

G1: Change the Indian language tech landscape with the open data contributions as DMU, NLTM <- Bhasini/ULCA/HuggingFace
G1.1:
G2: Become the de-facto provider of technology stack for translation <- National Translation Mission
G2.1:

G3: Become the de-facto provider of technology stack for language speech transcription <- BIRD, Prasar Bharati, OTT players, KCC
G4: Provide the first production-ready speech-to-speech translation engine <- PM Office(?)
G5: Digitize 1,000,000 books in Indian languages <- archive.org (?)

Tasks
0
Track
Task type
Task sub-type
Deadline
Particulars
Lead
Status
Details
Contributes to
ASR
Data
Collection
March 2022
100 hours of data per language collected with DesiCrew on Karya
Done
Data
Mining
April 2022
Analysis of mined data from NewsOnAir
Ongoing
Data
ULCA
April 2022
Insert 100 hours of DC data into ULCA
Planned
Models
Benchmark
April 2022
Models with DC data, with cleaned up text and inverse text normalization
Ongoing
Dissemination
Paper
April 2022
Based on DC data and newsonair mining
Ongoing
Data
Collection
May 2022
Collect data with Desicrew in x languages
Planned
Data
Partner
May 2022
Partnership with Prasar Bharati
Ongoing
Data
Release
May 2022
SUPERB benchmark
Planned
Tools
Shoonya
May 2022
Interface to collect data on Shoonya
Planned
Tools
Shoonya
June 2022
Chitralekha UI in Shoonya
Planned
Models
Edge models
June 2022
Make IndicTrans real-time on device
Ongoing
Tools
Partner
July 2022
BIRD project kickoff
Planned
Tools
ULCA
July 2022
Dataset types for language id, speaker id, and other SUPERB tasks
Planned
Models
Research
October 2022
Specialize models for domains
Planned
App
Release
December 2022
Voice based text input for Android
Planned
NMT
Data
Mining
April 2022
Samanantar v2
Ongoing
Data
Partner
April 2022
CIIL / NTM
Ongoing
Models
Release
May 2022
IndicTrans based on Samanantar v2
Planned
Tools
Shoonya
June 2022
Integrate NMT toolflow in Shoonya
Planned
XLit
Data
Mining
March 2022
Data mined from IndicCorp, Wikipedia, etc.
Done
Data
Collection
March 2022
Data collected on Karya with DesiCrew
Done
Dissemination
Paper
April 2022
Submit paper to TACL
Ongoing
Models
Release
April 2022
Release on HuggingFace
Ongoing
Tools
Shoonya
May 2022
Integrate XLit models in Shoonya
Planned
Data
ULCA
May 2022
Both training data and Akshantara benchmark
Planned
Models
ULCA
May 2022
Create model type and upload
Planned
App
Release
December 2022
Keyboard app for Android
Planned
Corpus
Data
Curation
April 2022
IndicCorp v2 in more languages
Ongoing
Data
Curation
May 2022
100,000 source native sentences in all languages
Ongoing
Dialog
Data
Collection
June 2022
Dialog data collection
Ongoing
NLU
Data
Curation
April 2022
IndicBERT v2
Ongoing
Data
Mining
May 2022
NER
Ongoing
Data
Collection
June 2022
Profane speech keywords across languages
Planned
OCR
Tools
Shoonya
April 2022
Interface for document based extraction in Shoonya
Ongoing
Data
Mining
May 2022
Screen text data
Ongoing
Tools
Shoonya
June 2022
OCR bounding box interface in Shoonya
Planned
Data
Mining
July 2022
Document dataset generation
Ongoing
TTS
Tools
Shoonya
July 2022
Shoonya backend for uploading the data
Planned
Data
Collection
December 2022
Data in all languages
Planned
Data
ULCA
January 2023
Upload all data to ULCA
Planned
Models
Release
March 2023
Models for all languages
Planned
Models
Research
July 2023
Speech to speech translation
Planned
NLG
Models
Data

Research Areas
0
Search
Multilingual Modeling
How do we go beyond current transfer architectures and improve transfer across languages? How to reduce language divergences in the multiNMT architecture? Better multilingual transfer for generation tasks like one to many translation What are the language-specific biases in current multilingual modeling approaches, which can be reduced to enable better multilingual transfer.
Self-Supervised Learning
Pre-training models have driven progress in AI. Some important directions to explore Additional pre-training loss functions for different NLG applications: Alternative pre-training objectives, data augmentation, connection to information theory, non-contrastive objective functions for faster training. Look at work in vision domain. Combining supervised and unsupervised data. In this context, understand when and how pre-training helps. Limited finetuning to optimize computational budget
Training with Noisy data
Across projects, we are relying on mined data for training our AI models. The noisy nature of these datasets is a fact of life. How do we train our models to perform better the face of such noisy training data. Repairing noisy training data Training objectives that take noise into account Knowledge Distillation Building Indic language specific semantic models (like LaBSE) as well as divergent semantic models
AI for extremely low-resource languages
Modern AI relies on large amount of data (raw or annotated). For low-resource languages beyond top-12 Indian languages, novel methods have to be developed. Pre-training for low-resources languages Utilizing dictionaries and other lexical resources Better use of language relatedness Better zeroshot transfer
Translation between Indian languages
While the major focus has been on translation between English and Indian languages, this is also an important need by itself as well as its potential to improve English ←→ IL translation for low-resource languages. The area has been under-investigated and some directions to explore include: Utilizing similarity between Indian languages Models combining all translation directions Multi-source translation systems
Efficient Models
Multilingual Modeling
Modality
Text
ASR
TTS
Description
How do we go beyond current transfer architectures and improve transfer across languages?
How to reduce language divergences in the multiNMT architecture?
Better multilingual transfer for generation tasks like one to many translation
What are the language-specific biases in current multilingual modeling approaches, which can be reduced to enable better multilingual transfer.

Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.