Skip to content
Quarterly Reports
Share
Explore
2022 Q1 Report

icon picker
1. Our Mission and Goals

Artificial Intelligence (AI) is one of the defining technologies of our age. Its progress, especially in the last decade, has been remarkable. Aided by abundant data, cheaper compute, and advances in algorithms and hardware, the sub-field of Deep Learning (DL) has enabled machines to reach human-level performance on specific tasks such as image classification, speech recognition, gameplay in games such as Go and DoTA, and several others. Tech corporations have earned enormous commercial success in deploying AI in ubiquitously used products and services.
With this backdrop, the Centre for AI4Bharat was founded with a tactical focus to build AI solutions to the problems of Bharat, today. We aim to bridge the wide gap between demonstrated AI technologies and our real-world challenges across domains of agriculture, healthcare, smart cities, digital India, and sustainability. The primary focus area of AI4Bharat would be to build language technology to provide equitable digital access to the large multilingual population of Bharat.

1.1 Language rich, resource poor

Indian Languages are an example of the diversity of India that is Bharat and their rich morphological structures. India is home to the fourth highest number of languages (447). Hindi-Urdu, Bengali and Punjabi are in the top 20 of spoken languages across the world. As per 2011 census, there are 1369 rationalised mother tongues (10K+ speakers) and they are grouped into 121 languages including 22 constitutionally recognized languages. 191 of these rationalised mother tongues are classified as vulnerable or endangered. 27 of non-scheduled languages have more than 1 Million Speakers and most of these are dialects / variants that are grouped under the Hindi language. Sanskrit, Kannanda, Telugu, Malayalam, Tamil, Odia have been given classical language status for their rich heritage and their independent nature. Some of the major Indian languages have enhanced their vocabulary by creating word banks in the native language covering many domains that play a role in our modern life. The Indian languages are characterised by their rich morphological structures and have a rich diversity that supports multiple dialects and accents and their heterogeneity enables diverse cultural ecosystems that are part of Bharat, that is India.
This rich, diverse, heterogeneous and multilingual aspect of our country has its pluses. However on the flip side it presents a major challenge in achieving the goals of Digital India and its objective that all the citizens in our country would prefer to access digital content and services in their native language(s). This is mainly due to resources needed to enable NLP in Indian languages are limited and have to be scaled to a significant extent. Comparison between high resource languages like English and Spanish and Indian languages for NLP readiness indicates that to achieve parity, there exists a need for a simple, holistic and elegant foundation that would enable effective natural language processing (AI) for Indian Languages. This foundation would possibly have to address many gaps.
This includes the need to build sufficient volume and quality of Indian languages text, speech and audio-visual data needed for Natural Language Processing and with ease of access to this data. There is a need to assign equal priority for Indian languages with smaller size of language users and lower language resources. This is important from preservation of Indian knowledge and wisdom embedded within these language ecosystems. There exists a requirement of sufficient tools in the input (keyboard, spell check) and output (display, fonts) of Indian languages supporting scripts for text that are easy for use, intuitive and support rich language structures for users to create, validate and consume language content. A similar requirement exists for audio tools that could effectively capture dialects and accents for speech data collection and its processing. Bringing disparate methods, techniques and models across speech, text and vision under a common unified language technology stack is one of unstated needs. The architecture of such a stack needs to be flexible to accommodate and adapt to existing as well as emerging technologies (Cloud and Edge). There exists multiple initiatives across India (Academia, Public sector, Private sector and start ups) and the emerging requirement is to define common standards for organising all assets of Indic-languages (Data, Models, User Interfaces) in a consolidated fashion in open source format for both commercial, social and public services to use and develop viable applications that support objectives of Digital India.
To summarise, there is a need to continuously enhance the resources for Indian languages and one of key objectives is to achieve a reasonable parity with high resource languages like English. This is a critical factor in ensuring existing users of NLP applications in English could consume the services in Indian languages. But more importantly, it mitigates the existing divide for the majority of Indian citizens who use Indian languages (Non-English) for their interactions and consuming services and applications in the digital world that is the new normal.

1.2 AI4Bharat’s Mission

Given the above context, the mission of AI4Bharat is to bring parity with English in AI tech for Indian languages
with open source contributions of tools, data, models, and solutions.
English and a few such high resource languages lie in a sweet spot of having large amounts of labelled and unlabelled data and are well studied in Natural language processing. Based on the extensive work done for these high resource languages, their AI ecosystem comprises of significant data, techniques, tools, models and APIs. This has enabled a significant number of AI-NLP language specific tools and applications to be developed for the delivery of commercial, social and public services for consumers in these languages.
In India’s context, multilinguality presents a major challenge in achieving the goals of Digital India. Billions of people would prefer to access digital content and services in their native language(s). For this to happen, there needs to be focus on a unified platform that is Bharat-Centric and emulates/ reuses some of the best practices of NLP work done for English and bring Indian languages in parity with it in terms of NLP ecosystem assets, comprising of data, tools, techniques, models and API.
This would ensure AI4Bharat’s vision is aligned to the National Language Technology mission’s (NLTM) aim of harnessing modern AI techniques for development of automatic speech recognition (ASR), Text to speech (TTS), Optical Character recognition(OCR) technologies for Indian languages. This platform will offer open data, open AI/ML models, open source code and open APIs to encourage open ecosystems for both contribution and adoption. This is expected to enable public and private sector organisations and startups to innovate and bring value-added services to citizens.
In addition, considering rich morphological structures in Indian languages and diversity in terms of dialects, accents and scripts, one of key objectives of AI4Bharat is to ensure this diversity/heterogeneity is retained to extent feasible in data collection, models and tools developed as part of the platform.
To summarise, AI4Bharat’s focus is to achieve the right balance of reuse of best practices adopted for building language technology for high resource languages such as English while ensuring richness, diversity, heterogeneity of Indian languages is retained in digital ecosystems using them for delivery of services in native languages to citizens of India.

1.3 AI4Bharat’s Goals

Goals
0
G1: Open Data and Shoonya
5
G1.1 NLTM
Play a central role in Bhashini in making the mission a success
G1.2 Annotator Team
Build a team of 10-15 internal & external language experts per language
G1.3 Shoonya
Develop Shoonya to support annotation for all tasks & languages
G1.4 Visibility in academia
Establish Indian languages datasets as benchmarks for evaluating AI models in academia internationally
G1.5 (Stretch) 🤗 for India
Get HuggingFace to build a separate (sub-)site on Indian language datasets and model demos
G2: Tech Stack for Translation
6
G2.1 NTM partnership
Establish a partnership with CIIL / NTM. Be open to supporting the many startups in this space
G2.1 Anuvaad <-> Shoonya
Integrate Anuvaad UI with Shoonya data integration
G2.3 HCI for translators
Establish an HCI practice to take inputs of professional translators to improve Anuvaad (eg. glossary, personalization, …)
G2.4 IndicTrans improvements
Improve IndicTrans with domain-specific fine-tuning, training on mined/collected corpora, edit data from Anuvaad, NER support, …
G2.5 (Stretch) N-way translation
Support n-way document editing - simultaneously translating across multiple languages
G2.6 Samanantar
Mine from growing crawls, CC-100, mC4
G3: Tech stack for transcription
5
G3.1 Prasar Bharati partnership
Strike a partnership with Prasar Bharati to release diverse benchmark and datasets as DMU, (possibly also OTT players)
G3.2 Chitralekha
Build Chitralekha UI as part of Shoonya application with IndicASR and IndicXLit support
BIRD - Education through SLS
Contribute to BIRD project to transcribe 1,000 movies across languages for enhancing literacy in Indian languages
G3.4 IndicASR improvements
Improve IndicASR with domain-specific fine-tuning, training on larger datasets, personalized models, on-device models, fine-tuning on edit data from Chitralekha …
G3.5 (Stretch) Multi-modal search
Build a models for multi-modal search across audio and transcripts and deploy for Prasar Bharati
G4: Speech-to-speech Translation
5
TTS
Release TTS datasets ands models on freely licensed voices across all languages
Pipeline to evaluate S2S output
Build a diverse benchmark and an efficient AI + manual approach to evaluate quality of S2S translation in Shoonya
Personalized TTS
Build models for personalized style transfer - voice cloning and capturing delivery/emphasis
NMT for spoken text
Improve NMT to support long informal sentences used in speech
(Stretch) Production S2S system
Achieve production readiness - entire pipeline works near real-time on a portable Nvidia DGX box
G5: Digitize a Million Books
5
OCR
Build datasets/models/benchmarks for OCR for 12 Indian scripts
Document Understanding
Build datasets/models/benchmarks for document understanding
OCR <-> Shoonya
Develop interfaces in Shoonya for correcting scanned content
A million books repo
Digitize 1 million books and host online with a searchable index
G5.5 (Stretch) On-demand books
Make books available page-wise and on-demand in text/speech in language of choice
G6: Natural Language Understanding
6
IndicCorp
Large and diverse collection of raw text data for 22 Indian languages
IndicGLUE
Benchmarks for NLU in 12 Indic languages
IndicXtreme
Benchmarks for cross-lingual NLU between English and Indian languages
IndicBERT
Pre-trained bidirectional language model for 22 Indian languages
IndicNER
NER for 11 Indian languages
Task-specific models
Models for Question Answering, sentiment analysis
G7: Natural Language Generation
4
IndicNLG Suite
Benchmarks for Indian language NLG tasks
IndicBART
Pretrained sequence to sequence model for 22 Indian languages
IndicGPT
Pre-trained causal language model for 22 Indian languages
Task-specific models
Models for tasks like summarization, paraphrase generation

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.