icon picker
The Present State of Indic LLM evaluations

Authors

Sagar Sarkale[People+ai]
Last updated: 20th November, 2024

Introduction

This document primarily discusses the current state of LLM and Indic LLM evaluations, their shortcomings in both Indic and Non-Indic spaces, and how they can be improved to create a robust benchmark for the Indic model.
Additionally, this document serves as an interactive resource where you can contribute your insights and suggestions through the provided form, helping refine the thesis for evaluation of Indic LLMs based on real-world expertise and requirements.

Why do we need Indic LLM evaluations and Benchmarks?

No central platform for developers and the community in general to identify and use current best Indic models is missing, which in turn hampers the overall progress of Indic models.
Majority of Indic models benchmarked on translated versions of existing English Benchmarks, rather than native content, this leads to a loss of missing patterns and cultural aspects unique to Indian languages.
Absence of standardized metrics to state usage of Indic Models for a certain use case calls for recognising capabilities of these models under realistic scenarios specific to India to ensure these models meet needs of Indian end users.

Which Non Indic LLM benchmarks exist?

LLM benchmarks try to assess a wide variety of capabilities of the model. Along with natural language generation and interpretation tasks, these benchmarks also try to evaluate knowledge in multiple domains. While the concepts in Math, Science, Common sense would still be the same in the context of Indian regions, way the questions are framed in India would be different.

Example 1 - Math
[English] “Alice brought 15 cookies for 150$ what is the price that she would have paid for 10 cookies?”
[Hindi] "Riya ne 5 kilo tamatar 200 rupay ke bhav se kharida. Yadi woh 2 kilo tamatar khareedti toh kitne rupay kharch hote?"
[Hindi] “रिया ने 5 किलो टमाटर 200₹ के भाव से खरीदा। यदि वह 2 किलो टमाटर खरीदती तो कितने रुपए खर्च होते?”

Example 2 - Common Sense
[English] “The trophy doesn't fit in the brown suitcase because it's too _. (small/big)”
[Hindi] “Pinki ko Neetu ki Diwali ki mithai pasand nahi aayi kyonki woh bahot ___ thi। (khatti/mithi)”
[Hindi] “पिंकी को नीतू की दिवाली की मिठाई पसंद नहीं आई क्योंकि वह बहुत ___ थी। (खट्टी/मीठी)”

The table below consists of various benchmarks and it’s adaptation to Indian origin.
Search
AGIEval
Professional Exam
ARC
Science
BBH
Logic
BIG-Bench
Common Sense
DROP
Math Reasoning
FLAN
Sentiment Analysis
FLAN
Summarization
GSM8K
Math Word Problems
MATH
Advanced Math
MMLU
Geography
MMLU
History
MMLU
Science
MT-Bench
Multi-turn Chat
MT-Bench
Multi-turn Chat
SuperGLUE
Reading
TruthfulQA
General Knowledge
WinoGrande
Common Sense
AGIEval
Dataset Name
Professional Exam
Task
Reasoning
Sample Input (English)
"A car travels 120 miles in 2 hours. What is its average speed?"
Sample Output (English)
"60 miles per hour"
Sample Input (Hindi)
This is a sample input adapted to Indian Origin
"एक किसान 2 एकड़ जमीन में धान की खेती करता है। प्रति एकड़ उपज 30 क्विंटल है। यदि धान का समर्थन मूल्य 2000 रुपये प्रति क्विंटल है, तो उसकी कुल आय क्या होगी?"
Sample Output (Hindi)
This is a sample output adapted to Indian Origin
"चरण 1: कुल उपज = 2 एकड़ × 30 क्विंटल = 60 क्विंटल\nचरण 2: कुल आय = 60 क्विंटल × 2000 रुपये\n= 1,20,000 रुपये"

Is adapting existing English Benchmark for Indian origin a value add to current state of Indic Model Benchmarking?

If you disagree with above statement, please add a reason for the same?

Which Indic LLM benchmarks exist?


What we can observe from existing benchmarks:
Strong reliance on translation as a methodology to create benchmark data.
Human verification is generally as a quality control step post automated collection/ generation.
Most of the benchmarks use existing resources (Wikipedia, News, other datasets) as a primary source.
Recent benchmarks (2024) show more focus on systematic translation pipelines and verification processes.

Where can we innovate or add value?
We can collect or generate quality benchmark data which has not been built yet.
We can create a new evaluation metric to address the limitations of existing ones.

Below is a comprehensive list of Indic benchmarks and their details.
Search
INDIC GLUE
News Category Classification
INDIC GLUE
Headline Prediction
INDIC GLUE
Wikipedia Section-Title Prediction
INDIC GLUE
Cloze-style QA
INDIC GLUE
Named Entity Recognition
INDIC GLUE
Cross-lingual Sentence Retrieval
INDIC GLUE
Winograd NLI
INDIC GLUE
COPA
INDIC GLUE
Paraphrase Detection
INDIC GLUE
Discourse Mode Classification
IndicXTREME
IndicSentiment
IndicXTREME
IndicXNLI
IndicXTREME
IndicCOPA
IndicXTREME
IndicXPara
IndicXTREME
M-Intent
IndicXTREME
Naamapadam
IndicXTREME
M-SlotFill
IndicXTREME
IndicQA
IndicXTREME
FLORES
IndicNLG
Biography Generation
IndicNLG
Headline Generation
IndicNLG
Sentence Summarization
IndicNLG
Paraphrase Generation
IndicNLG
Question Generation
ARC-Easy
General Knowledge QA
ARC-Challenge
Advanced Knowledge QA
Hellaswag
Common Sense Reasoning
MMLU
Multi-task Language Understanding
BoolQ
Yes/No Question Answering
CROSSSUM-IN
Cross-lingual Summarization
FLORES-IN
Machine Translation
XQUAD-IN
Multilingual QA
XORQA-IN-XX
Cross-lingual QA
XORQA-IN-EN
Cross-lingual QA
RECON
Cross-lingual Evaluation
INTEL
Training Data for Cross-lingual Evaluation
MILU
Multiple Choice Questions
INDIC GLUE
creator_org
AI4Bharat & IIT Madras
release_year
2,020
task
News Category Classification
task_type
Text Classification
method_brief
Semi-automatic creation using URL components for category labels
data_source
News websites
sample_input
News content
sample_output
Entertainment/Sports/Politics/etc.
scoring_method
Accuracy
languages_covered
9 languages
domains_covered
News
size_per_language
~3K-30K articles
total_size
125,630

Based on existing data sources we can categorise the sources as follows.
Summary of categories of data
Category
Content Type
Primary Sources
3
Wikipedia Articles & Biographical Content
News Media & Headlines
Government & Political Documents
Academic & Scientific
5
Scientific & Mathematics Content
Environmental & Medical Science
Educational & Academic Texts
Arts & Humanities Materials
Social Science Documents
User-Generated
5
Product Reviews
Consumer Feedback
User Queries
Conversational Data
Intent-based Conversations
Reference
5
General Knowledge Questions
Common Sense Scenarios
Creative Content
Theory of Mind Scenarios
Counterfactual Situations

What other categories or sources of data should we consider to build a more robust benchmark?

Limitations existing LLM evaluations


1/ Most of the current evaluations as well as training datasets are literal translations of existing English benchmarks.
Why does this matter?
People do not talk the way current datasets are
Language is much simpler in day to day usage
Meaning of translated sentence changes
task is to predict what caused the premise this happened
e.g Premise: “The girl ran out of energy.” > “लड़की ऊर्जा से भाग गई।”
e.g Premise: “The bolt tightened.” > “बोल्ट कड़ा हो गया।”

Can we say we need a better translation/ adaptation of benchmarks?

2/ Currently all the benchmarks operate on standardized versions of language and do not capture the dialect and regional nuances.

Can we say we need a better coverage of various regions and their usage of languages?

3/ Most of the datasets for both training and evaluation are created out of openly available newspapers, government websites, X language wiki, which is a very formal language.
Why does this matter?
These evaluations will penalize informal but valid language usages

Can we say we need a our benchmarks to cover usage of both formal and informal of languages?

4/ Translation of existing Machine translation and Natural language Interpretation benchmark datasets:
One of the benchmarks is a Translation of FLORES (Google's dataset for evaluating Machine translation) for understanding capabilities of Indic models.
Proper nouns that this dataset contains there is a high chance that Indian users will never use those nouns / entities : e.g “Mr. Rudd said XYZ”, “Bucharest City Hall had XYZ” and so on.

Can we say we need to collect information around of local places, names, objects which are specific to a certain region in order to create a more robust benchmarks?

5/ Metric used for Machine translation (Limitations of BLEU in Indic context)
Why does this matter?
Valid order of Indic language may be penalized because reference answer’s order
Honorifics and Number of references -
e.g “Did you eat food?” -> “Aapne khana khaya?”, “Tumne khana khaya?”
Both are valid depends on which reference translation is present in the dataset.
Existing benchmarks primarily have only one reference translation
Similarly Other tasks also have a reference answer against which LLM responses are compared.

Can we say that having multiple reference answers to capture varied responses of language model help in accurately benchmarking generation capabilities of LLM?

6/ Domain knowledge understanding and benchmarks
Currently we know that LLMs (Non Indic SOTA models) can accurately generate content and answer questions about multiple domains

Can we say that building a knowledge base of across multiple domains will be a valuable asset to benchmark Indic models in assessing knowledge based tasks?

Can we say that having a state school like curriculum tests for LLMs be a reflective measure of how smart a model is to say - ABC Indic model is smarter than 7th grader?

7/ In order to evaluate an Indic model at scale we currently have to rely heavily on Human evaluators and their expertise in the topic. Which can be unreliable and very time consuming effort.
Why this matters?
As the benchmark datasets grows both in size and diversity, finding human evaluators for this will not be feasible.

Can we say we need a mechanism which will act as “Indic - LLM Judge” to evaluate responses of models against a reference?

What are some precursors that would be needed to get started with “Indic - LLM Judge”?

8/ Current benchmarks do not distinguish between SLMs and LLMs, SLMs obviously lack in knowledge tasks, but can perform language tasks at par with LLMs

Can we say we need to have a different way of evaluating SLMs and LLMs in Indic space?



ASR model and evaluations


Introduction


This document examines Automatic Speech Recognition (ASR) systems, with special attention to their application in Indian languages. It analyzes current limitations in both Indian and non-Indian language contexts, and proposes ways to strengthen evaluation standards to create more reliable benchmarking for Indian language ASR models.
Moreover, it is designed as a collaborative platform where you can share your expertise and recommendations through an integrated feedback form. Your input will help shape better evaluation criteria for Indian language ASR systems based on practical experience and actual needs.

Why do we need Indic ASR evaluations and Benchmarks?

Vast Regional Language Diversity across India presents unique challenges with multiple speaking styles, accents, and dialects, making it crucial to evaluate ASR models against this complex linguistic landscape.
Lack of Standardized Evaluation Methods prevents reliable assessment of ASR model performance, making it difficult to quantify accuracy and effectiveness across different Indian languages and dialects.
Absence of a Centralized Comparison Platform limits developers and researchers from systematically identifying model limitations and benchmarking different ASR systems.

Which Non Indic ASR benchmarks exist?


Search
LibriSpeech
1000 hours
Common Voice
21000 hours
VoxPopuli
2000 hours
TED-LIUM 3
452 hours
GigaSpeech
10,000 hours
SPGISpeech
5000 hours
Earnings-22
119 hours
AMI
100 hours
LibriSpeech
Size
1000 hours
Domain
Audiobooks
Input Format
16kHz FLAC files with speaker/chapter metadata
Output Format
• Text transcriptions • Word-level alignments • Speaker identifiers
Data Source
LibriVox Project
Key Features
• Clean read speech • Multiple speakers • Chapter-level organization • High-quality alignments
Reference


Which Indic ASR benchmarks exist?


Search
Kathbath
1527 hours
Kathbath-Hard
1527 hours
FLEURS
90 hours
CommonVoice
373 hours
IndicTTS
192 hours
MUCS
403 hours
Gramvaani
100 hours
Kathbath
Size
Total across multiple languages*
1527 hours
Input Format
Android phone recordings
Output Format
• Text transcriptions
Domain
Wikipedia & News
Key Features
• Covers all 12 languages • Read speech collection • Mobile app based collection • Source text from Wikipedia & news
Data Source
AI4Bharat

Though extensive coverage across languages is seen in the above datasets, there are some gaps which can be seen in the same:
Most of the datasets are read speech - lack spontaneity like in conversational speech
Limited coverage of domains - Education, Religion, Technical, Medical missing
Most of the recordings are in controlled environment and lack real world background noise
Not every language is equally assessed due to lack of data in relatively low resource languages

Do you agree with the above points?

Current state of Leaderboards


How does current Non Indic ASR leaderboard look like?
image.png

How does current Indic ASR leaderboard look like?
image.png
source : (reports WER scores for various datasets)

Key Questions Unaddressed by Leaderboards


Typical ASR leaderboards showcase:
Word Error Rate (WER) across various datasets
Real-Time Factor (RTFx) for processing speed
Performance on standardized datasets (e.g., LibriSpeech, AMI, TedLium, Gigaspeec, etc)

But what are some key shortcomings of leaderboards like these?

1/ Context-Specific Performance Thresholds
Critical Question: What constitutes an acceptable WER for specific use cases?
Current limitations:
No industry-specific benchmarks
Lack of context for interpreting WER scores
No guidance on minimum acceptable performance thresholds
Examples of varying WER requirements:
Medical transcription: <5% WER typically required
Meeting transcription: <10% WER might be acceptable
Casual conversation: <15% WER could be sufficient

Can we say we need a standard metric for each domain and a better way to interpret it?

2/ Use Case Relevance
Critical Question: How well do benchmark datasets represent real-world applications?
Gaps in current evaluation:
Limited domain coverage
No industry-specific datasets (healthcare, legal, education)
Lack of real-world acoustic conditions
Missing everyday conversational scenarios
Absence of specialized vocabulary testing
No evaluation of handling domain-specific jargon

Can we say we need to benchmark ASR models on domain specific real world data?

3/ Speaker Recognition Capabilities
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.