People+ai Theses

icon picker
Data

Ten Trillion Tokens
India's AI potential hinges on developing a vast corpus of data in Indian languages and contexts, estimated at ten trillion tokens across various topics, languages, and modalities.
Given the current data size of approximately 100 billion tokens, achieving this goal requires a collaborative, open effort. By uniting stakeholders such as universities, model builders, and government agencies, we can create an open, digital public good that addresses India's unique challenges. These include improving healthcare access, managing tuberculosis treatment, simplifying welfare access for low-literacy populations, and delivering agricultural expertise in local languages. Through partnerships with the government and the implementation of population-scale use cases, we aim to establish a data feedback loop that continuously enriches the corpus.
Project: Ten Trillion Tokens will foster an open data and open model ecosystem, unlocking AI's value for India's diverse population.
Where do you see a gap in the language capabilities of Indian models?
Will data rooted in Indian languages and context be the key to building better AI for Indian?

IndicLLMSuite’s Sangraha is the largest high-quality, cleaned Indic language pre-training data containing only 251B tokens across 22 languages!

Sangraha is built on 3 types of data:
Verified data: Containing scraped data from Websites, OCR-extracted data from high quality Indic language PDFs, transcribed data from various Indic language videos, podcasts, movies, courses, etc.
Unverified data: High quality Indic language data extracted from existing multilingual corpora after filtering
Synthetic data: WikiMedia English translated to 14 Indic languages and further "Romanised" from 14 languages by transliteration to English.
There exists vast in improving the performance of Indic LLMs in divided across- LLMs, corpora, evaluation, techniques, and tools. This is an indication of the usefulness of building an Indic language toolkit. Despite the effort, the tokens collected are not near the magnitude we require.
Despite its impressive scale, several gaps and limitations have been identified:
Quality Variability: While efforts have been made to ensure high-quality data, the intrinsic variability in quality from different sources (websites, PDFs, videos) poses a challenge. This variability may affect the consistency and reliability of the models trained on this dataset.
Limited Representation of Low-Resource Languages: Some languages, particularly those with a lower digital presence, are underrepresented. This limitation affects the dataset's ability to capture the full linguistic diversity of India, including dialects and regional variations.
Synthetic Data Concerns: A significant portion of the dataset is derived from translations of existing English datasets. While this method increases the dataset's size, it may not accurately reflect real-world language use, potentially impacting the model's performance in generating natural responses.
Crowdsourcing Limitations: The crowdsourced data tends to have uneven coverage across different Indian states and demographics, particularly lacking representation from certain age groups and regions.
Evaluation Needs: There is a need for further research to evaluate the effectiveness of the models trained on this dataset across various applications and domains, which is not fully addressed in the current framework.
These gaps highlight ongoing challenges in developing comprehensive and representative datasets for Indic languages, suggesting areas for future research and improvement.
Are you are a startup or org creating and looking for better data curation pipelines, pipelines that work with all the nuances of Indian dataset? Share your insights with us.
For individuals:
For organisations:

Quantity and quality is vital to building a capable model

Accessibility to data limits and concentrates the data to majorly Hindi and leaves larger gaps in other languages. Dialects and spoken language have even fewer instances in training data.
Major media produced in Indian languages is not diverse and is restricted to mass-media content, with high quality podcasts and educational resources being much smaller in volume.
The major sources of
for pre-training in Indic languages are Wikipedia and web-crawling. However, quality is compromised as Wikipedia is high quality text that is sparsely populated while corpora from CommonCrawl and mC4 is unfiltered and noisy.
Quality data can be created and collected but requires a large human effort to curate.

A highly capable model needs data for pre-training, fine-tuning and modality tuning, each of which are focused on different data QUALITY and QUANTITY.

Method
What is the method and the process?
Human effort
Compute effort (what metric?)
Cost(in money approx.)
1
Translating wikipedia
IndicTrans (AI4B)
Low
High
Medium
2
Crawling the web for indic data
Setu (AI4B)
Low
Medium
Low
3
Vendor based- digitisation of published data
How do we choose the right kind of vendors? What are the costs and incentives?
High
Low
Very High
4
Volunteer based- data creation camps and activities(eg- colleges)
How to create engagement to bring an audience? How do you set up the process to ensure quality?
High
Low
High
5
Textualising Youtube, News and other Audio Visual data sources
How do we choose the correct existing data- what are the metrics? Can we use Same Language Subtitling effort to increase data availability?
Low
High
Medium
6
Government and public good use cases to generate data.
What kind of data can the use case generate? How do we create an architecture that allows the data to be shared back in public domain while preserving user privacy?
High
High
High
7
Private Public Partnerships with licensing
Can we get the government to share royalty free access to all data from Doordarshan, AIR, Government publishing houses
Medium
Low
High
There are no rows in this table

Our plan is using multiple generation pipelines to get to Ten Trillion Tokens

Translating Wikipedia to generate synthetic data

Synthetic data can be generated by pure translation of English content on Wikipedia.
This will require that the content be filtered to ensure knowledge rich data. The translations must also be verified if done by machines to ensure correctness, which may be lost in translation due to misinterpretation of context and lack of understanding of Indic languages in some translation cases.
The process will be compute intensive but low on human effort. An initial model can be created and bootstrapped to translate the remaining data.

Crawling existing Indic web sources

A simple web crawler has been set up to pick up Indian language text being posted on the web. This process is run every 3–4 months with minimal human intervention and some compute.
This is expected to generate tokens in the order of a few billions.
The above methods are existing and can be set up with low human effort.

Knowledge dense data from printed and publisher content

To generate rich knowledge dense data, publishers, and print content can be looped into the dataset. It is ideal to partner with larger publishers rather than going after smaller players, as the process setup leads to more tokens in such cases.
This is more complicated as there is a need for larger human efforts including
Setting up systems for acquiring the data and scanning it into electronic format.
Establishing payment models to incentivise publishers. This could be cutting a chunk of the profits generated by the model’s commercial uses or an upfront payment.

Using data creation tasks from public camps and activities

Data generation tasks where a knowledge dense text is translated into Indian languages can be used. One such cohort was conducted by IITM, a college where 500 students attend an LLM workshop and translate a 500 word piece to any local language. This is expected to generate quality and diversity in data.
However, problems of measuring this data quality would need to be tackled. The process can iteratively be refined to get better results from each workshop. The engagement of such programs in colleges and universities can be incentivised and promoted by faculty.
To go about creating such multi-modal data can take 2 approaches- (a) data generation through use cases and (b) sourcing existing data.

Generation of data by crafting the right use cases

Interactions of people with existing apps and websites can be captured. Many Bhashini use cases have a data loop where the data captured is used to enhance the model. The contribution of many such use cases would contribute significantly to the TTT goal. Guardrails will have to be set up to ensure safety and ethical guidelines are followed. Quality checks and preprocessing pipelines will have to be levied to ensure the data is useful and meets the standard.
Private and public partnerships would be needed to reach our goal. Governments can aid in implementing a licence to share data back, every user of a language would need to anonymise and share data to improve the model.
What use cases can generate text or voice data in Indian languages?
1
Launch a Meta Whatsapp Bot
Call centres using automated voice bots
There are no rows in this table

Textualise audio and video content from Youtube, News channels, etc

Youtube has a plethora of Indian educational content and entertainment in regional languages. News streams run 24x7 live streams and this data can be passed through a ASR model to lay out a large amount of data. This data however is low in quality- booth audio and content. The knowledge heavy data is much less that the knowledge light data and needs manual selection. These are useful in obtaining speech tokens to build the natural speech capacity of the model.
1
Courtroom transcripts
Voice artists content
Political debates
2
Indian podcasts of quality
Government call centre recordings
There are no rows in this table
Enhance these dataset gaps — what streams of Indian audio have we left out?
Are we missing something? Are there other ways to collect Indian data that we are missing? Let us know here.

Synthetic data

Sanskrit as a base language

Projects

Project/ Seed
Description
Status
Volunteer/ Partner
1
Licensing private use case data to form open-source datasets
Building a use case to generate a closed loop of data generation and model improvementt
Ongoing
Click me
2
Mapping the exiting number of Indic tokens in every Indian language
Open
Click me
3
Creating student guilds for data creation through workshops and datathons
Open
Click me
4
Digitising existing paper media
Open
Click me
5
Curation speech tokens in Indian dialects and accents
Open
Click me
There are no rows in this table

Call to action!

Reaching the TTT goal will be a collective effort of many people and organisations. Co-operation of the government in providing data, startups who generate data and big conglomerates to accelerate these efforts are all needed. That's not all, researchers need to drive the direction of data collection to create more effective and cost-reduced tools and pipelines.
Data is everything. In a world where every model runs on AI and AI is built on data — India can only catch up to the quality of foundational models if it invests in generation of these tokens. Applications require the understanding of Indian languages and context to provide accessible, effective and useful solutions to an Indian audience.
Partner with us to reach our TTT goal. What specific effort can we collaborate on?

This document was reviewed by:

Mitesh Khapra - AI4Bharath

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.