TTT

Research: How do we get to TTT

Leaderboard: Better evaluations for Indic Models

Use cases for Indic Models

The Present State of Indic LLM evaluations

Explore

Ten Trillion Tokens

tl; dr

Low-resourced languages are underrepresented in datasets, leading to a plateau in models’ performance and capabilities.

⁠

Existing evaluations and leaderboards for Indic models are not robust and cannot assess poor performance in real world applications.

⁠

Objectives of the project

Identifying the datasets that achieve the goal of language understanding but also critical to development of use cases.

To see how good current models are on these, we need benchmarks & leaderboards. Our first aim is to build this to understand where we are, where we lack and start collecting data accordingly.

We will also need experimenting with various licensing regimes/business models to create incentive structures for data contribution/usage that sustains this activity.

In the process we will build guardrails for safety/context understanding

⁠

TTT projects

⁠

Find our projects

⁠

Part I: The problem and gaps where we will direct our efforts

A: Low-resourced languages are underrepresented in datasets, leading to a plateau in models’ performance and capabilities.

IndicLLMSuite’s

Sangraha⁠

is the largest high-quality, cleaned Indic language pre-training data containing only 251B tokens inclusive of 22 languages!

It is difficult to identify the right kind of data for any task. Typically, a highly capable model needs data for pre-training, fine tuning and modality tuning.

Pre-training data to improve a models understanding of a language and the world

Data for pretraining is rich in knowledge density, language usage and diversity of opinion and domain.

We need a magnitude of 1-10 Trillion tokens in every language

Fine-tuning data is required to meet every use-case’s requirements

Fine-tuning data would include domain specific knowledge, prompt-response pairs or output style interactions

An estimated 1 Million prompt response pairs per language will be needed

Multimodality data

Diverse and high quality conversational and image data is needed to improve multimodal capabilities

10-100K hours of data will be needed

There are no rows in this table

⁠

Research⁠

on improving the performance of Indic LLMs is divided across- models, corpora, evaluation, techniques and tools.

Despite the several efforts to collect tokens across languages and domains, the magnitude of Indic data does not meet our needs.

Listed are the factors for the lack of data in Indic languages:

For data to be usable quality is vital, maintaining quality across the collection sources and even identifying the right metric are often difficult. Moreover a vast human effort may be involved in quality assurance processes for datasets- finding the right skills, incentivising and financing these efforts is challenging.

Volume of content generation in the Indian language is much lower than western language. Although these languages native to a very large population, much of the language usage, be it conversations or writing, is not captured.

This is further limited by a lower digital presence in many communities, creating holes in the language corpus. The data that can be used ultimately must be digitised to train models and extra efforts to digitise Indic language resources add a steep barrier.

Accessibility to data limits and concentrates the data to majorly Hindi and leaves larger gaps in other languages. Dialects and spoken language have even fewer instances in training data.

Translation makes the majority of the dataset that is used to trai Indic models. The quality of the dataset is reduced due to the lack of accuracy and a loss of inherent linguistics by the translation models.

The major sources of

data⁠

for pre-training in Indic languages are Wikipedia and web-crawling. However quality is compromised as Wikipedia is high quality text that is sparsely populated while corpora from CommonCrawl and mC4 is unfiltered and noisy.

Crowdsourcing has been another popular method of building parts of these datasets because the demographics of data collected by this method tend to be biassed and lacking in quality.

Major media produced in Indian languages is not diverse and is restricted to mass-media content with high quality podcasts and educational resources being much smaller in volume.

Perhaps the most difficult task of all is identifying the right ‘data mix’- combination of domains and tasks that will improve a model’s performance.

B: Existing evaluations and leaderboards for Indic models are not robust and cannot assess poor performance in real world applications.

There is a need for research to evaluate the effectiveness of the models trained on this dataset across various applications and domains, which is not fully addressed in the current framework.

Here are some shortcomings of Indic model evaluations:

Larger benchmarks in Indic languages fail to capture end user needs and remain largely academic. Evaluation models are not useful in many scenarios involving real world applications as many of these benchmarks are made keeping purely academic goals of model improvement in mind.

When models are used in applications today, they are applied as a chain involving TTS, ASR, generation etc. Evaluations that are built to find the best model at one task are not useful in evaluating the process as a whole.

Evaluation of models today do not cover the unique nuances that Indian languages and use cases present, the present evaluation does not test for Indian context.

Models need to be evaluated for effectiveness in reach- at a population scale. Reaching the population scale is made more difficult by the nuances and complexities of Indian languages and the number of dialects, accents and colloquialisms within regions.

Alongside the multilingual or overall model performance, domain and task specific performance is vital to find the best fit for an enterprise AI application.

How does an AI application builder choose a language model?

To choose a model off the market today, companies perform evaluations on their private data sets. The greatest difference in evaluation is made by the quality of the evaluation dataset which is very closely tied to the task the models are used for.

⁠

Part 2: Defining the scope and objectives we find value in

The data and models problem is huge and trying to solve for all edge cases would be a mammoth effort. We are setting ambitious goals and defining a scope that allows for the most impact to be made.

1. Of the 270 odd languages in India- where do we begin?

2. What are the important domains to be looking at? And what use cases are most relevant to us?

To make progress the fastest we will assume all efforts are primarily within the 10 most spoken languages in India-

Hindi: 52.83 crore; Bengali: 9.72 crore

Marathi: 8.30 crore

Telugu: 8.11 crore

Tamil: 6.90 crore

Gujarati: 5.54 crore

Kannada: 4.37 crore

Odia: 3.75 crore

Malayalam and Urdu

Here are our objectives to reach Ten Trillion Tokens

Identifying the datasets that achieve the goal of language understanding but also critical development use cases.

To see how good current models are on these, we need benchmarks & leaderboards. Our first aim is to build this to understand where we are, where we lack and start collecting data accordingly.

We will also need experimenting with various licensing regimes/business models to create incentive structures for data contribution/usage that sustains this activity.

In the process we will build guardrails for safety/context understanding

Part 3: Reaching the objectives, our outcomes and success measures

A: Models will be able to better understand Indic languages and perform better on Use case specific tasks. Answers generated will be rooted in an Indian context. In the process datasets, data collection tools and processes will be built.

Models will get better at language understanding. ASR and TTS models will be enterprise standard across use cases. Training and fine tuning for any use cases or small language models will be well resourced.

Dataset will be built with:

Detailed schema to represent the complex dataset

Multi-label, multi-category annotation of the dataset

Safety annotations of the dataset

Raw and tokenized content

Tooling contributions:

Automatic data collection pipelines

Family of tokenizers

Tooling to support human raters/writers

Automatic pre training data quality check

Autoraters for final response generation

To get started on building our datasets we will run multiple experiments- clever ideas that test how we can build this dataset and the quality and contribution of different methods.

No one process or source will give us all this data, so we need to start trying ambitious projects that will give us large volumes of data at low-cost using India’s advantages. This will also include understanding and scaling tools for data collection, tools/methods for QA/verification, etc.

These are our ideas:

For Haqdarshaq and MoSJE- build an AI intervention at different layers of the welfare delivery system

Place an application in an i-pad at schools to capture student and teacher interactions, bootstrap an existing AI model to create small exercises that engage the students, feed the data collected from recording back to the model and keep improving it

Same language subtitling in movies

News and podcasts licensing from Doordarshan

Telangana government TB follow ups

Sound captcha- ask people to recognise the image display verbally

Donating to the effort to capture domain specific tokens

Volunteer efforts at colleges and universities to write an essay on Indian topics in different languages

Large scale essay, creative writing, elocution competitions whose submission data can be fed into the corpus

Digitisation and publisher licences of Indian books

Recording and transcribing courtroom sessions or legislative conferences(following suite of what the EU enforced)

Encouraging more Indian content and Indian language blogs and publishing online by identifying and bringing down existing barriers

Are there any use-cases that stand out to you? Or any that you think we have missed?

Only show me

Ideas

Author

There are no rows in this table

⁠

B: We will have an evaluation framework and leaderboards for Indian languages that is effective and useful to real world applications and research benchmarks alike. The evaluation would be fair and unbiased.

A research team will identify the tasks and metrics to evaluate Indic models. Tey will assess what the datasets should look like.

An engineering team will build and maintain the evaluations by running periodic evaluations and ensuring the datasets remain uncorrupted and up to date.

1. What are the use cases and language model tasks that we should evaluate for?

2. What factors are most important to you when choosing a model - for example, open-source availability, multilingual capabilities, or specific performance metrics? How do you prioritise these factors?

3. How do you define and measure "success" for the tasks where you're using language models? What metrics or outcomes are most critical ?

4. How do you balance trade-offs between different metrics for various tasks?

5. How can data sets be maintained and enriched by private evaluations?

Are you a model builder, user or evaluator? What are the 'tasks' that Language Models are not working for?

Model don't work for-

Author

There are no rows in this table

⁠

How do you choose what is the best model?

We choose the best model by-

Author

There are no rows in this table

⁠

C: Build a use case pipeline that can be applied to create fine tuning data sets

Reaching the ten trillion tokens goal is not only about data, but ensuring the right data is collected at the right opportunity. This is a multi party effort to build tools and a stack to capture and open source data for any use case.

Creating a DPI stack

Creating a data flywheel- self-reinforcing cycle where the more data a system collects, the more value it can provide. This will lead to a compounding effect on insights and innovation. As the volume, variety, and velocity of data curated increases, the process is continually improved and it has more potential to generate valuable data. This will include:

Data collection efforts from above methods

Quality checks

Evaluation of performance using the data

Human feedback to improve the data

Target use cases

Domain and language specific conversational or audio chatbots

Factual information gathering

Educational assistance

Language tutor

Explaining complex topics in multiple languages

Customer care support

Writing assistant

Code writing and debugging

Automatic machine translation

Form filling through voice

How do we pick use cases? What is the measure of success of a use case?

How do we measure impactful use cases?

Author

There are no rows in this table

⁠

Part 4: Why people+ai cannot do it alone, collaborations and volunteer efforts

A multi part effort driven by India’s population advantage

The goal to reach Ten Trillion Tokens is an ambitious one- it will require multi-party and multi-year private, government and public partnerships.

However, with a young and driven population and the expertise of academics and the industry to guide us, we are better placed than any other nation to reach this goal. Where corporations have spent millions to collect data outside of India, we can leverage population and expertise to bring down the cost of data collection.

Call to action!

Reaching the TTT goal will be a collective effort of many people and organisations. Co-operation of the government in providing data, startups who generate data and big conglomerates to accelerate these efforts are all needed. That's not all researchers need to drive the direction of data collection to create more effective and cost-reduced tools and pipelines.

Data is everything. In a world where every model runs on AI and AI is built on data- India can only catch up to the quality of foundational models if it invests in generation of these tokens and data generation stacks. Applications require the understanding of Indian languages and context to provide accessible, effective and useful solutions to an Indian audience.