Ten Trillion Tokens

exercise

tl; dr

Low-resourced languages are underrepresented in datasets, leading to a plateau in models’ performance and capabilities.
Existing evaluations and leaderboards for Indic models are not robust and cannot assess poor performance in real world applications.

Objectives of the project

Identifying the datasets that achieve the goal of language understanding but also critical to development of use cases.
To see how good current models are on these, we need benchmarks & leaderboards. Our first aim is to build this to understand where we are, where we lack and start collecting data accordingly.
We will also need experimenting with various licensing regimes/business models to create incentive structures for data contribution/usage that sustains this activity.
In the process we will build guardrails for safety/context understanding

TTT projects

Part I: The problem and gaps where we will direct our efforts

A: Low-resourced languages are underrepresented in datasets, leading to a plateau in models’ performance and capabilities.

IndicLLMSuite’s is the largest high-quality, cleaned Indic language pre-training data containing only 251B tokens inclusive of 22 languages!
It is difficult to identify the right kind of data for any task. Typically, a highly capable model needs data for pre-training, fine tuning and modality tuning.
1
Pre-training data to improve a models understanding of a language and the world
Data for pretraining is rich in knowledge density, language usage and diversity of opinion and domain.
We need a magnitude of 1-10 Trillion tokens in every language
2
Fine-tuning data is required to meet every use-case’s requirements
Fine-tuning data would include domain specific knowledge, prompt-response pairs or output style interactions
An estimated 1 Million prompt response pairs per language will be needed
3
Multimodality data
Diverse and high quality conversational and image data is needed to improve multimodal capabilities
10-100K hours of data will be needed
There are no rows in this table
on improving the performance of Indic LLMs is divided across- models, corpora, evaluation, techniques and tools.
Despite the several efforts to collect tokens across languages and domains, the magnitude of Indic data does not meet our needs.
Listed are the factors for the lack of data in Indic languages:
For data to be usable quality is vital, maintaining quality across the collection sources and even identifying the right metric are often difficult. Moreover a vast human effort may be involved in quality assurance processes for datasets- finding the right skills, incentivising and financing these efforts is challenging.
Volume of content generation in the Indian language is much lower than western language. Although these languages native to a very large population, much of the language usage, be it conversations or writing, is not captured.
This is further limited by a lower digital presence in many communities, creating holes in the language corpus. The data that can be used ultimately must be digitised to train models and extra efforts to digitise Indic language resources add a steep barrier.
Accessibility to data limits and concentrates the data to majorly Hindi and leaves larger gaps in other languages. Dialects and spoken language have even fewer instances in training data.
Translation makes the majority of the dataset that is used to trai Indic models. The quality of the dataset is reduced due to the lack of accuracy and a loss of inherent linguistics by the translation models.
The major sources of
for pre-training in Indic languages are Wikipedia and web-crawling. However quality is compromised as Wikipedia is high quality text that is sparsely populated while corpora from CommonCrawl and mC4 is unfiltered and noisy.
Crowdsourcing has been another popular method of building parts of these datasets because the demographics of data collected by this method tend to be biassed and lacking in quality.
Major media produced in Indian languages is not diverse and is restricted to mass-media content with high quality podcasts and educational resources being much smaller in volume.
Perhaps the most difficult task of all is identifying the right ‘data mix’- combination of domains and tasks that will improve a model’s performance.

B: Existing evaluations and leaderboards for Indic models are not robust and cannot assess poor performance in real world applications.

There is a need for research to evaluate the effectiveness of the models trained on this dataset across various applications and domains, which is not fully addressed in the current framework.
Here are some shortcomings of Indic model evaluations:
Larger benchmarks in Indic languages fail to capture end user needs and remain largely academic. Evaluation models are not useful in many scenarios involving real world applications as many of these benchmarks are made keeping purely academic goals of model improvement in mind.
When models are used in applications today, they are applied as a chain involving TTS, ASR, generation etc. Evaluations that are built to find the best model at one task are not useful in evaluating the process as a whole.
Evaluation of models today do not cover the unique nuances that Indian languages and use cases present, the present evaluation does not test for Indian context.
Models need to be evaluated for effectiveness in reach- at a population scale. Reaching the population scale is made more difficult by the nuances and complexities of Indian languages and the number of dialects, accents and colloquialisms within regions.
Alongside the multilingual or overall model performance, domain and task specific performance is vital to find the best fit for an enterprise AI application.
How does an AI application builder choose a language model?
To choose a model off the market today, companies perform evaluations on their private data sets. The greatest difference in evaluation is made by the quality of the evaluation dataset which is very closely tied to the task the models are used for.

Part 2: Defining the scope and objectives we find value in

The data and models problem is huge and trying to solve for all edge cases would be a mammoth effort. We are setting ambitious goals and defining a scope that allows for the most impact to be made.
1. Of the 270 odd languages in India- where do we begin?
2. What are the important domains to be looking at? And what use cases are most relevant to us?
To make progress the fastest we will assume all efforts are primarily within the 10 most spoken languages in India-
Hindi: 52.83 crore; Bengali: 9.72 crore
Marathi: 8.30 crore
Telugu: 8.11 crore
Tamil: 6.90 crore
Gujarati: 5.54 crore
Kannada: 4.37 crore
Odia: 3.75 crore
Malayalam and Urdu
Here are our objectives to reach Ten Trillion Tokens
Identifying the datasets that achieve the goal of language understanding but also critical development use cases.
To see how good current models are on these, we need benchmarks & leaderboards. Our first aim is to build this to understand where we are, where we lack and start collecting data accordingly.
We will also need experimenting with various licensing regimes/business models to create incentive structures for data contribution/usage that sustains this activity.
In the process we will build guardrails for safety/context understanding

Part 3: Reaching the objectives, our outcomes and success measures

A: Models will be able to better understand Indic languages and perform better on Use case specific tasks. Answers generated will be rooted in an Indian context. In the process datasets, data collection tools and processes will be built.

Models will get better at language understanding. ASR and TTS models will be enterprise standard across use cases. Training and fine tuning for any use cases or small language models will be well resourced.
Dataset will be built with:
Detailed schema to represent the complex dataset
Multi-label, multi-category annotation of the dataset
Safety annotations of the dataset
Raw and tokenized content
Tooling contributions:
Automatic data collection pipelines
Family of tokenizers
Tooling to support human raters/writers
Automatic pre training data quality check
Autoraters for final response generation

To get started on building our datasets we will run multiple experiments- clever ideas that test how we can build this dataset and the quality and contribution of different methods.

No one process or source will give us all this data, so we need to start trying ambitious projects that will give us large volumes of data at low-cost using India’s advantages. This will also include understanding and scaling tools for data collection, tools/methods for QA/verification, etc.
These are our ideas:
For Haqdarshaq and MoSJE- build an AI intervention at different layers of the welfare delivery system
Place an application in an i-pad at schools to capture student and teacher interactions, bootstrap an existing AI model to create small exercises that engage the students, feed the data collected from recording back to the model and keep improving it
Same language subtitling in movies
News and podcasts licensing from Doordarshan
Telangana government TB follow ups
Sound captcha- ask people to recognise the image display verbally
Donating to the effort to capture domain specific tokens
Volunteer efforts at colleges and universities to write an essay on Indian topics in different languages
Large scale essay, creative writing, elocution competitions whose submission data can be fed into the corpus
Digitisation and publisher licences of Indian books
Recording and transcribing courtroom sessions or legislative conferences(following suite of what the EU enforced)
Encouraging more Indian content and Indian language blogs and publishing online by identifying and bringing down existing barriers
question-mark

Are there any use-cases that stand out to you? Or any that you think we have missed?

Only show me
Ideas
Author
There are no rows in this table

B: We will have an evaluation framework and leaderboards for Indian languages that is effective and useful to real world applications and research benchmarks alike. The evaluation would be fair and unbiased.

A research team will identify the tasks and metrics to evaluate Indic models. Tey will assess what the datasets should look like.
An engineering team will build and maintain the evaluations by running periodic evaluations and ensuring the datasets remain uncorrupted and up to date.
1. What are the use cases and language model tasks that we should evaluate for?
2. What factors are most important to you when choosing a model - for example, open-source availability, multilingual capabilities, or specific performance metrics? How do you prioritise these factors?
3. How do you define and measure "success" for the tasks where you're using language models? What metrics or outcomes are most critical ?
4. How do you balance trade-offs between different metrics for various tasks?
5. How can data sets be maintained and enriched by private evaluations?
question-mark

Are you a model builder, user or evaluator? What are the 'tasks' that Language Models are not working for?


Model don't work for-
Author
There are no rows in this table
question-mark

How do you choose what is the best model?

We choose the best model by-
Author
There are no rows in this table

C: Build a use case pipeline that can be applied to create fine tuning data sets

Reaching the ten trillion tokens goal is not only about data, but ensuring the right data is collected at the right opportunity. This is a multi party effort to build tools and a stack to capture and open source data for any use case.

Creating a DPI stack

Creating a data flywheel- self-reinforcing cycle where the more data a system collects, the more value it can provide. This will lead to a compounding effect on insights and innovation. As the volume, variety, and velocity of data curated increases, the process is continually improved and it has more potential to generate valuable data. This will include:
Data collection efforts from above methods
Quality checks
Evaluation of performance using the data
Human feedback to improve the data
Target use cases
Domain and language specific conversational or audio chatbots
Factual information gathering
Educational assistance
Language tutor
Explaining complex topics in multiple languages
Customer care support
Writing assistant
Code writing and debugging
Automatic machine translation
Form filling through voice
question-mark

How do we pick use cases? What is the measure of success of a use case?

How do we measure impactful use cases?
Author
There are no rows in this table

Part 4: Why people+ai cannot do it alone, collaborations and volunteer efforts

A multi part effort driven by India’s population advantage

The goal to reach Ten Trillion Tokens is an ambitious one- it will require multi-party and multi-year private, government and public partnerships.
However, with a young and driven population and the expertise of academics and the industry to guide us, we are better placed than any other nation to reach this goal. Where corporations have spent millions to collect data outside of India, we can leverage population and expertise to bring down the cost of data collection.

Call to action!

Reaching the TTT goal will be a collective effort of many people and organisations. Co-operation of the government in providing data, startups who generate data and big conglomerates to accelerate these efforts are all needed. That's not all researchers need to drive the direction of data collection to create more effective and cost-reduced tools and pipelines.
Data is everything. In a world where every model runs on AI and AI is built on data- India can only catch up to the quality of foundational models if it invests in generation of these tokens and data generation stacks. Applications require the understanding of Indian languages and context to provide accessible, effective and useful solutions to an Indian audience.
How you can contribute
1. Join our group that believes in the goal of generating Ten Trillion Indic tokens, and approach partners.
2. Pick a project that resonates with you
3. Fund the project and/or suggest incentive schemes

Ongoing projects

Idea
Progress
Read more
1
Literature Survey on Indic LLMs and its use cases
2
Realistic ASR evaluations
3
Populating and prototyping use cases to generate data flywheels
There are no rows in this table

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.