Multilingual Modeling
Limited training data is one of the central challenges in building AI models for low-resource languages. There is also an asymmetry in the availability of data resources across languages. For instance, while Hindi-English parallel corpora might be abundant, Gujarati-English or Gujarati-Hindi data is limited. A recent advancement in NLP has been the training multilingual models utilizing data from multiple languages simultaneously. Such model also low-resource languages to benefit from larger corpora in high-resource languages. In some scenarios, the models can give good performance on a language even when no training data for that language is available - referred to as zero-shot cross-lingual transfer.
We have also seen benefits from multilingual models in the training of the IndicTrans translation model and other models we have trained. Multilingual learning is an important area of research and there are many open questions to explore in order to achieve high levels of cross-lingual transfer. Broadly, some research directions to explore are:
How do we go beyond current transfer architectures and improve transfer across languages?
Better multilingual transfer for generation tasks like one to many translation
How to reduce language divergences in the multiNMT architecture?
What are the language-specific biases in current multilingual modeling approaches, which can be reduced to enable better multilingual transfer.
A lot of the success in multilingual learning has been in the text domain. We would like to push the limits of multilingual learning in the speech modality as well as speech+text modality.
Self-Supervised Learning
Pre-trained models like BERT, BART, Wav2Vec have driven progress in AI. They impart models with prior knowledge and can be learnt from raw corpora (i.e. no supervised annotations are required) using a process called self-supervised learning. This reduces the need for supervised data required for a particular task. Some important directions to explore are:
Language-group specific pre-trained models. Our previous publications have established the utility of language-group specific pre-trained models, and we will continue to explore further in this direction to utilize linguistic similarities in the pre-training process.
Pre-training loss functions.
Combining supervised and unsupervised data. In this context, understand when and how pre-training helps.
Combining multilingual learning and self-supervised learning.
Faster training and finetuning to optimize computational budget.
Training with Noisy data
Across projects, we are relying on mined data for training our AI models. The noisy nature of these datasets is a fact of life. How do we train our models to perform better the face of such noisy training data? Some of the directions we plan to explore are:
Repairing noisy training data
Training objectives that take noise into account
Knowledge Distillation
Building Indic language specific semantic models (like LaBSE) as well as divergent semantic models to be able to detect noise in data at a fine-grained level.
AI for extremely low-resource languages
Modern AI relies on large amount of data (raw or annotated). For low-resource languages beyond top-12 Indian languages, even basic resource like raw text are hard to come by. Hence, novel methods have to be developed to address these high resource-constrained scenarios. Some Directions to explore.
Pre-training for low-resources languages
Utilizing dictionaries and other lexical resources
Better use of language relatedness
Better zeroshot transfer
Translation between Indian languages
While the major focus has been on translation between English and Indian languages, this is also an important need by itself as well as its potential to improve English ←→ Indian language translation for low-resource languages. The area has been under-investigated and some directions to explore include:
Utilizing similarity between Indian languages
Models combining all translation directions
Multi-source translation systems
Efficient Deep Learning Models
High-performing deep learning models are compute and memory-intensive. Such models increase training time, experimental cycles as well as increase the cost of deployment of models at scale. We would like to explore solutions in the following directions:
Compressed models
Knowledge Distillation
Efficient training objectives