Models will be able to better understand Indic languages and perform better on Use case specific tasks. Answers generated will be rooted in an Indian context. In the process datasets, data collection tools and processes will be built.
Models will get better at language understanding. ASR and TTS models will be enterprise standard across use cases. Training and fine tuning for any use cases or small language models will be well resourced.
Dataset will be built with:
Detailed schema to represent the complex dataset
Multi-label, multi-category annotation of the dataset
Safety annotations of the dataset
Raw and tokenized content
Tooling contributions:
Automatic data collection pipelines
Family of tokenizers
Tooling to support human raters/writers
Automatic pre training data quality check
Autoraters for final response generation
To get started on building our datasets we will run multiple experiments- clever ideas that test how we can build this dataset and the quality and contribution of different methods.
No one process or source will give us all this data, so we need to start trying ambitious projects that will give us large volumes of data at low-cost using India’s advantages. This will also include understanding and scaling tools for data collection, tools/methods for QA/verification, etc.
These are our ideas:
For Haqdarshaq and MoSJE- build an AI intervention at different layers of the welfare delivery system
Place an application in an i-pad at schools to capture student and teacher interactions, bootstrap an existing AI model to create small exercises that engage the students, feed the data collected from recording back to the model and keep improving it
Same language subtitling in movies
News and podcasts licensing from Doordarshan
Telangana government TB follow ups
Sound captcha- ask people to recognise the image display verbally
Donating to the effort to capture domain specific tokens
Volunteer efforts at colleges and universities to write an essay on Indian topics in different languages
Large scale essay, creative writing, elocution competitions whose submission data can be fed into the corpus
Digitisation and publisher licences of Indian books
Recording and transcribing courtroom sessions or legislative conferences(following suite of what the EU enforced)
Encouraging more Indian content and Indian language blogs and publishing online by identifying and bringing down existing barriers
Are there any use-cases that stand out to you? Or any that you think we have missed?
Only show me
There are no rows in this table
Contribute: Volunteer and Open roles
We are open to contributions and feedback, feel free to leave a comment and let us know you are interested. We will reach out to you.