Skip to content
Gallery
Quarterly Reports
Share
Explore
2022 Q1 Report

icon picker
5. Research

Natural Language Processing is a fast-evolving discipline with rapid innovations and inter-disciplinary advancements enabling better and novel technological solutions to a various use cases. At AI4Bharat, we are working on multiple problems across the language landscape involving diverse languages and different modalities viz., text, speech and images. Many of these languages have characteristics different from the ones for which NLP solutions are typically available in terms of linguistic properties and availability of linguistic resources. In the Indian context, novel usecases will arise that require innovative solutions. We are also mindful of making the most efficient usage of computational resources in order to train and deploy AI models at scale - an area that useful solutions are called for. We are in the process of creating a strong research program to address these challenges and make impactful scientific contributions. The following are the goals of our research program:
Focus on research areas of interest aligned to the goals of AI4Bharat.
Collaborate with researchers and research groups to build a community that tackles relevant problems.
Train and motivate budding AI researchers to contribute to AI solutions in the Indian context.
Publish impactful research that will also showcase our capabilities in cutting-edge research.
Open-source code, models and datasets resulting from the research to attract more researchers to work on these research areas.
Research Areas
Search
Multilingual Modeling
Limited training data is one of the central challenges in building AI models for low-resource languages. There is also an asymmetry in the availability of data resources across languages. For instance, while Hindi-English parallel corpora might be abundant, Gujarati-English or Gujarati-Hindi data is limited. A recent advancement in NLP has been the training multilingual models utilizing data from multiple languages simultaneously. Such model also low-resource languages to benefit from larger corpora in high-resource languages. In some scenarios, the models can give good performance on a language even when no training data for that language is available - referred to as zero-shot cross-lingual transfer. We have also seen benefits from multilingual models in the training of the IndicTrans translation model and other models we have trained. Multilingual learning is an important area of research and there are many open questions to explore in order to achieve high levels of cross-lingual transfer. Broadly, some research directions to explore are: How do we go beyond current transfer architectures and improve transfer across languages? Better multilingual transfer for generation tasks like one to many translation How to reduce language divergences in the multiNMT architecture? What are the language-specific biases in current multilingual modeling approaches, which can be reduced to enable better multilingual transfer. A lot of the success in multilingual learning has been in the text domain. We would like to push the limits of multilingual learning in the speech modality as well as speech+text modality.
Self-Supervised Learning
Pre-trained models like BERT, BART, Wav2Vec have driven progress in AI. They impart models with prior knowledge and can be learnt from raw corpora (i.e. no supervised annotations are required) using a process called self-supervised learning. This reduces the need for supervised data required for a particular task. Some important directions to explore are: Language-group specific pre-trained models. Our previous publications have established the utility of language-group specific pre-trained models, and we will continue to explore further in this direction to utilize linguistic similarities in the pre-training process. Pre-training loss functions. Combining supervised and unsupervised data. In this context, understand when and how pre-training helps. Combining multilingual learning and self-supervised learning. Faster training and finetuning to optimize computational budget.
Training with Noisy data
Across projects, we are relying on mined data for training our AI models. The noisy nature of these datasets is a fact of life. How do we train our models to perform better the face of such noisy training data? Some of the directions we plan to explore are: Repairing noisy training data Training objectives that take noise into account Knowledge Distillation Building Indic language specific semantic models (like LaBSE) as well as divergent semantic models to be able to detect noise in data at a fine-grained level.
AI for extremely low-resource languages
Modern AI relies on large amount of data (raw or annotated). For low-resource languages beyond top-12 Indian languages, even basic resource like raw text are hard to come by. Hence, novel methods have to be developed to address these high resource-constrained scenarios. Some Directions to explore. Pre-training for low-resources languages Utilizing dictionaries and other lexical resources Better use of language relatedness Better zeroshot transfer
Translation between Indian languages
While the major focus has been on translation between English and Indian languages, this is also an important need by itself as well as its potential to improve English ←→ Indian language translation for low-resource languages. The area has been under-investigated and some directions to explore include: Utilizing similarity between Indian languages Models combining all translation directions Multi-source translation systems
Efficient Deep Learning Models
High-performing deep learning models are compute and memory-intensive. Such models increase training time, experimental cycles as well as increase the cost of deployment of models at scale. We would like to explore solutions in the following directions: Compressed models Knowledge Distillation Efficient training objectives
Multilingual Modeling
Modality
Text
ASR
TTS
Description
Limited training data is one of the central challenges in building AI models for low-resource languages. There is also an asymmetry in the availability of data resources across languages. For instance, while Hindi-English parallel corpora might be abundant, Gujarati-English or Gujarati-Hindi data is limited. A recent advancement in NLP has been the training multilingual models utilizing data from multiple languages simultaneously. Such model also low-resource languages to benefit from larger corpora in high-resource languages. In some scenarios, the models can give good performance on a language even when no training data for that language is available - referred to as zero-shot cross-lingual transfer.
We have also seen benefits from multilingual models in the training of the IndicTrans translation model and other models we have trained. Multilingual learning is an important area of research and there are many open questions to explore in order to achieve high levels of cross-lingual transfer. Broadly, some research directions to explore are:
How do we go beyond current transfer architectures and improve transfer across languages?
Better multilingual transfer for generation tasks like one to many translation
How to reduce language divergences in the multiNMT architecture?
What are the language-specific biases in current multilingual modeling approaches, which can be reduced to enable better multilingual transfer.
A lot of the success in multilingual learning has been in the text domain. We would like to push the limits of multilingual learning in the speech modality as well as speech+text modality.

Publications

Our work over the last couple of years have resulted in papers at top-tier conferences as well as widely used datasets and models.

List of Publications
Not synced yet
2022
4
Towards Building ASR Systems for the Next Billion Users
Tahir Javed, Sumanth Doddapaneni, Abhigyan Raman, Kaushal Santosh Bhogale, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M Khapra
AAAI
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages.
Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Shantadevi Khapra
Transactions of the ACL
IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages.
Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra, Pratyush Kumar
Findings of ACL
IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages
Aman Kumar, Himani Shrotriya, Prachi Sahu, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Amogh Mishra, Mitesh M. Khapra, Pratyush Kumar
arXiv preprint arXiv:2203.05437
2021
1
A primer on pretrained multilingual language models
Sumanth Doddapaneni, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M Khapra
arXiv preprint arXiv:2107.00676
2020
1
IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages.
Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar
Findings of EMNLP

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.