Use cases for Indic Models

Models will be able to better understand Indic languages and perform better on Use case specific tasks. Answers generated will be rooted in an Indian context. In the process datasets, data collection tools and processes will be built.

Models will get better at language understanding. ASR and TTS models will be enterprise standard across use cases. Training and fine tuning for any use cases or small language models will be well resourced.

Dataset will be built with:

Detailed schema to represent the complex dataset

Multi-label, multi-category annotation of the dataset

Safety annotations of the dataset

Raw and tokenized content

Tooling contributions:

Automatic data collection pipelines

Family of tokenizers

Tooling to support human raters/writers

Automatic pre training data quality check

Autoraters for final response generation

To get started on building our datasets we will run multiple experiments- clever ideas that test how we can build this dataset and the quality and contribution of different methods.

No one process or source will give us all this data, so we need to start trying ambitious projects that will give us large volumes of data at low-cost using India’s advantages. This will also include understanding and scaling tools for data collection, tools/methods for QA/verification, etc.

These are our ideas:

For Haqdarshaq and MoSJE- build an AI intervention at different layers of the welfare delivery system

Place an application in an i-pad at schools to capture student and teacher interactions, bootstrap an existing AI model to create small exercises that engage the students, feed the data collected from recording back to the model and keep improving it

Same language subtitling in movies

News and podcasts licensing from Doordarshan

Telangana government TB follow ups

Sound captcha- ask people to recognise the image display verbally

Donating to the effort to capture domain specific tokens

Volunteer efforts at colleges and universities to write an essay on Indian topics in different languages

Large scale essay, creative writing, elocution competitions whose submission data can be fed into the corpus

Digitisation and publisher licences of Indian books

Recording and transcribing courtroom sessions or legislative conferences(following suite of what the EU enforced)

Encouraging more Indian content and Indian language blogs and publishing online by identifying and bringing down existing barriers

Are there any use-cases that stand out to you? Or any that you think we have missed?

Only show me

Ideas

Author

There are no rows in this table

⁠

Contribute: Volunteer and Open roles

We are open to contributions and feedback, feel free to leave a comment and let us know you are interested. We will reach out to you.

⁠

Open hyperlink

⁠

Models will be able to better understand Indic languages and perform better on Use case specific tasks. Answers generated will be rooted in an Indian context. In the process datasets, data collection tools and processes will be built.

To get started on building our datasets we will run multiple experiments- clever ideas that test how we can build this dataset and the quality and contribution of different methods.

Are there any use-cases that stand out to you? Or any that you think we have missed?

Contribute: Volunteer and Open roles

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.