Models will be able to better understand Indic languages and perform better on Use case specific tasks. Answers generated will be rooted in an Indian context. In the process datasets, data collection tools and processes will be built.
Models will get better at language understanding. ASR and TTS models will be enterprise standard across use cases. Training and fine tuning for any use cases or small language models will be well resourced.
Dataset will be built with:
Detailed schema to represent the complex dataset Multi-label, multi-category annotation of the dataset Safety annotations of the dataset Raw and tokenized content Tooling contributions:
Automatic data collection pipelines Tooling to support human raters/writers Automatic pre training data quality check Autoraters for final response generation To get started on building our datasets we will run multiple experiments- clever ideas that test how we can build this dataset and the quality and contribution of different methods.
No one process or source will give us all this data, so we need to start trying ambitious projects that will give us large volumes of data at low-cost using India’s advantages. This will also include understanding and scaling tools for data collection, tools/methods for QA/verification, etc.
These are our ideas:
For Haqdarshaq and MoSJE- build an AI intervention at different layers of the welfare delivery system Place an application in an i-pad at schools to capture student and teacher interactions, bootstrap an existing AI model to create small exercises that engage the students, feed the data collected from recording back to the model and keep improving it Same language subtitling in movies News and podcasts licensing from Doordarshan Telangana government TB follow ups Sound captcha- ask people to recognise the image display verbally Donating to the effort to capture domain specific tokens Volunteer efforts at colleges and universities to write an essay on Indian topics in different languages Large scale essay, creative writing, elocution competitions whose submission data can be fed into the corpus Digitisation and publisher licences of Indian books Recording and transcribing courtroom sessions or legislative conferences(following suite of what the EU enforced) Encouraging more Indian content and Indian language blogs and publishing online by identifying and bringing down existing barriers Are there any use-cases that stand out to you? Or any that you think we have missed?
Contribute: Volunteer and Open roles
We are open to contributions and feedback, feel free to leave a comment and let us know you are interested. We will reach out to you.