Quarterly Reports

Explore

2022 Q1 Report

1. Our Mission and Goals

Artificial Intelligence (AI) is one of the defining technologies of our age. Its progress, especially in the last decade, has been remarkable. Aided by abundant data, cheaper compute, and advances in algorithms and hardware, the sub-field of Deep Learning (DL) has enabled machines to reach human-level performance on specific tasks such as image classification, speech recognition, gameplay in games such as Go and DoTA, and several others. Tech corporations have earned enormous commercial success in deploying AI in ubiquitously used products and services.

With this backdrop, the Centre for AI4Bharat was founded with a tactical focus to build AI solutions to the problems of Bharat, today. We aim to bridge the wide gap between demonstrated AI technologies and our real-world challenges across domains of agriculture, healthcare, smart cities, digital India, and sustainability. The primary focus area of AI4Bharat would be to build language technology to provide equitable digital access to the large multilingual population of Bharat.

1.1 Language rich, resource poor

Indian Languages are an example of the diversity of India that is Bharat and their rich morphological structures. India is home to the fourth highest number of languages (447). Hindi-Urdu, Bengali and Punjabi are in the top 20 of spoken languages across the world. As per 2011 census, there are 1369 rationalised mother tongues (10K+ speakers) and they are grouped into 121 languages including 22 constitutionally recognized languages. 191 of these rationalised mother tongues are classified as vulnerable or endangered. 27 of non-scheduled languages have more than 1 Million Speakers and most of these are dialects / variants that are grouped under the Hindi language. Sanskrit, Kannanda, Telugu, Malayalam, Tamil, Odia have been given classical language status for their rich heritage and their independent nature. Some of the major Indian languages have enhanced their vocabulary by creating word banks in the native language covering many domains that play a role in our modern life. The Indian languages are characterised by their rich morphological structures and have a rich diversity that supports multiple dialects and accents and their heterogeneity enables diverse cultural ecosystems that are part of Bharat, that is India.

This rich, diverse, heterogeneous and multilingual aspect of our country has its pluses. However on the flip side it presents a major challenge in achieving the goals of Digital India and its objective that all the citizens in our country would prefer to access digital content and services in their native language(s). This is mainly due to resources needed to enable NLP in Indian languages are limited and have to be scaled to a significant extent. Comparison between high resource languages like English and Spanish and Indian languages for NLP readiness indicates that to achieve parity, there exists a need for a simple, holistic and elegant foundation that would enable effective natural language processing (AI) for Indian Languages. This foundation would possibly have to address many gaps.

This includes the need to build sufficient volume and quality of Indian languages text, speech and audio-visual data needed for Natural Language Processing and with ease of access to this data. There is a need to assign equal priority for Indian languages with smaller size of language users and lower language resources. This is important from preservation of Indian knowledge and wisdom embedded within these language ecosystems. There exists a requirement of sufficient tools in the input (keyboard, spell check) and output (display, fonts) of Indian languages supporting scripts for text that are easy for use, intuitive and support rich language structures for users to create, validate and consume language content. A similar requirement exists for audio tools that could effectively capture dialects and accents for speech data collection and its processing. Bringing disparate methods, techniques and models across speech, text and vision under a common unified language technology stack is one of unstated needs. The architecture of such a stack needs to be flexible to accommodate and adapt to existing as well as emerging technologies (Cloud and Edge). There exists multiple initiatives across India (Academia, Public sector, Private sector and start ups) and the emerging requirement is to define common standards for organising all assets of Indic-languages (Data, Models, User Interfaces) in a consolidated fashion in open source format for both commercial, social and public services to use and develop viable applications that support objectives of Digital India.

To summarise, there is a need to continuously enhance the resources for Indian languages and one of key objectives is to achieve a reasonable parity with high resource languages like English. This is a critical factor in ensuring existing users of NLP applications in English could consume the services in Indian languages. But more importantly, it mitigates the existing divide for the majority of Indian citizens who use Indian languages (Non-English) for their interactions and consuming services and applications in the digital world that is the new normal.

1.2 AI4Bharat’s Mission

Given the above context, the mission of AI4Bharat is to bring parity with English in AI tech for Indian languages

with open source contributions of tools, data, models, and solutions.

English and a few such high resource languages lie in a sweet spot of having large amounts of labelled and unlabelled data and are well studied in Natural language processing. Based on the extensive work done for these high resource languages, their AI ecosystem comprises of significant data, techniques, tools, models and APIs. This has enabled a significant number of AI-NLP language specific tools and applications to be developed for the delivery of commercial, social and public services for consumers in these languages.

In India’s context, multilinguality presents a major challenge in achieving the goals of Digital India. Billions of people would prefer to access digital content and services in their native language(s). For this to happen, there needs to be focus on a unified platform that is Bharat-Centric and emulates/ reuses some of the best practices of NLP work done for English and bring Indian languages in parity with it in terms of NLP ecosystem assets, comprising of data, tools, techniques, models and API.

This would ensure AI4Bharat’s vision is aligned to the National Language Technology mission’s (NLTM) aim of harnessing modern AI techniques for development of automatic speech recognition (ASR), Text to speech (TTS), Optical Character recognition(OCR) technologies for Indian languages. This platform will offer open data, open AI/ML models, open source code and open APIs to encourage open ecosystems for both contribution and adoption. This is expected to enable public and private sector organisations and startups to innovate and bring value-added services to citizens.

In addition, considering rich morphological structures in Indian languages and diversity in terms of dialects, accents and scripts, one of key objectives of AI4Bharat is to ensure this diversity/heterogeneity is retained to extent feasible in data collection, models and tools developed as part of the platform.

To summarise, AI4Bharat’s focus is to achieve the right balance of reuse of best practices adopted for building language technology for high resource languages such as English while ensuring richness, diversity, heterogeneity of Indian languages is retained in digital ecosystems using them for delivery of services in native languages to citizens of India.

1.3 AI4Bharat’s Goals

Goals

Goals

G1: Open Data and Shoonya

G1.1 NLTM

Play a central role in Bhashini in making the mission a success

G1.2 Annotator Team

Build a team of 10-15 internal & external language experts per language

G1.3 Shoonya

Develop Shoonya to support annotation for all tasks & languages

G1.4 Visibility in academia

Establish Indian languages datasets as benchmarks for evaluating AI models in academia internationally

G1.5 (Stretch) 🤗 for India

Get HuggingFace to build a separate (sub-)site on Indian language datasets and model demos

G2: Tech Stack for Translation

G2.1 NTM partnership

Establish a partnership with CIIL / NTM. Be open to supporting the many startups in this space

G2.1 Anuvaad <-> Shoonya

Integrate Anuvaad UI with Shoonya data integration

G2.3 HCI for translators

Establish an HCI practice to take inputs of professional translators to improve Anuvaad (eg. glossary, personalization, …)

G2.4 IndicTrans improvements

Improve IndicTrans with domain-specific fine-tuning, training on mined/collected corpora, edit data from Anuvaad, NER support, …

G2.5 (Stretch) N-way translation

Support n-way document editing - simultaneously translating across multiple languages

G2.6 Samanantar

Mine from growing crawls, CC-100, mC4

G3: Tech stack for transcription

G3.1 Prasar Bharati partnership

Strike a partnership with Prasar Bharati to release diverse benchmark and datasets as DMU, (possibly also OTT players)

G3.2 Chitralekha

Build Chitralekha UI as part of Shoonya application with IndicASR and IndicXLit support

BIRD - Education through SLS

Contribute to BIRD project to transcribe 1,000 movies across languages for enhancing literacy in Indian languages

G3.4 IndicASR improvements

Improve IndicASR with domain-specific fine-tuning, training on larger datasets, personalized models, on-device models, fine-tuning on edit data from Chitralekha …

G3.5 (Stretch) Multi-modal search

Build a models for multi-modal search across audio and transcripts and deploy for Prasar Bharati

G4: Speech-to-speech Translation

TTS

Release TTS datasets ands models on freely licensed voices across all languages

Pipeline to evaluate S2S output

Build a diverse benchmark and an efficient AI + manual approach to evaluate quality of S2S translation in Shoonya

Personalized TTS

Build models for personalized style transfer - voice cloning and capturing delivery/emphasis

NMT for spoken text

Improve NMT to support long informal sentences used in speech

(Stretch) Production S2S system

Achieve production readiness - entire pipeline works near real-time on a portable Nvidia DGX box

G5: Digitize a Million Books

OCR

Build datasets/models/benchmarks for OCR for 12 Indian scripts

Document Understanding

Build datasets/models/benchmarks for document understanding

OCR <-> Shoonya

Develop interfaces in Shoonya for correcting scanned content

A million books repo

Digitize 1 million books and host online with a searchable index

G5.5 (Stretch) On-demand books

Make books available page-wise and on-demand in text/speech in language of choice

G6: Natural Language Understanding

IndicCorp

Large and diverse collection of raw text data for 22 Indian languages

IndicGLUE

Benchmarks for NLU in 12 Indic languages

IndicXtreme

Benchmarks for cross-lingual NLU between English and Indian languages

IndicBERT

Pre-trained bidirectional language model for 22 Indian languages

IndicNER

NER for 11 Indian languages

Task-specific models

Models for Question Answering, sentiment analysis

G7: Natural Language Generation

IndicNLG Suite

Benchmarks for Indian language NLG tasks

IndicBART

Pretrained sequence to sequence model for 22 Indian languages

IndicGPT

Pre-trained causal language model for 22 Indian languages

Task-specific models

Models for tasks like summarization, paraphrase generation

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.