TTT

Research: How do we get to TTT

Leaderboard: Better evaluations for Indic Models

Use cases for Indic Models

The Present State of Indic LLM evaluations

Explore

The Present State of Indic LLM evaluations

Authors

Sagar Sarkale[People+ai]

Last updated: 20th November, 2024

Introduction

This document primarily discusses the current state of LLM and Indic LLM evaluations, their shortcomings in both Indic and Non-Indic spaces, and how they can be improved to create a robust benchmark for the Indic model.

Additionally, this document serves as an interactive resource where you can contribute your insights and suggestions through the provided form, helping refine the thesis for evaluation of Indic LLMs based on real-world expertise and requirements.

Why do we need Indic LLM evaluations and Benchmarks?

No central platform for developers and the community in general to identify and use current best Indic models is missing, which in turn hampers the overall progress of Indic models.

Majority of Indic models benchmarked on translated versions of existing English Benchmarks, rather than native content, this leads to a loss of missing patterns and cultural aspects unique to Indian languages.

Absence of standardized metrics to state usage of Indic Models for a certain use case calls for recognising capabilities of these models under realistic scenarios specific to India to ensure these models meet needs of Indian end users.

Which Non Indic LLM benchmarks exist?

LLM benchmarks try to assess a wide variety of capabilities of the model. Along with natural language generation and interpretation tasks, these benchmarks also try to evaluate knowledge in multiple domains. While the concepts in Math, Science, Common sense would still be the same in the context of Indian regions, way the questions are framed in India would be different.

Example 1 - Math

[English] “Alice brought 15 cookies for 150$ what is the price that she would have paid for 10 cookies?”

[Hindi] "Riya ne 5 kilo tamatar 200 rupay ke bhav se kharida. Yadi woh 2 kilo tamatar khareedti toh kitne rupay kharch hote?"

[Hindi] “रिया ने 5 किलो टमाटर 200₹ के भाव से खरीदा। यदि वह 2 किलो टमाटर खरीदती तो कितने रुपए खर्च होते?”

Example 2 - Common Sense

[English] “The trophy doesn't fit in the brown suitcase because it's too _. (small/big)”

[Hindi] “Pinki ko Neetu ki Diwali ki mithai pasand nahi aayi kyonki woh bahot ___ thi। (khatti/mithi)”

[Hindi] “पिंकी को नीतू की दिवाली की मिठाई पसंद नहीं आई क्योंकि वह बहुत ___ थी। (खट्टी/मीठी)”

The table below consists of various benchmarks and it’s adaptation to Indian origin.

AGIEval

Professional Exam

ARC

Science

BBH

Logic

BIG-Bench

Common Sense

DROP

Math Reasoning

FLAN

Sentiment Analysis

FLAN

Summarization

GSM8K

Math Word Problems

MATH

Advanced Math

MMLU

Geography

MMLU

History

MMLU

Science

MT-Bench

Multi-turn Chat

MT-Bench

Multi-turn Chat

SuperGLUE

Reading

TruthfulQA

General Knowledge

WinoGrande

Common Sense

AGIEval

Dataset Name

Professional Exam

Task

Reasoning

Sample Input (English)

"A car travels 120 miles in 2 hours. What is its average speed?"

Sample Output (English)

"60 miles per hour"

Sample Input (Hindi)

This is a sample input adapted to Indian Origin

"एक किसान 2 एकड़ जमीन में धान की खेती करता है। प्रति एकड़ उपज 30 क्विंटल है। यदि धान का समर्थन मूल्य 2000 रुपये प्रति क्विंटल है, तो उसकी कुल आय क्या होगी?"

Sample Output (Hindi)

This is a sample output adapted to Indian Origin

"चरण 1: कुल उपज = 2 एकड़ × 30 क्विंटल = 60 क्विंटल\nचरण 2: कुल आय = 60 क्विंटल × 2000 रुपये\n= 1,20,000 रुपये"

⁠

Is adapting existing English Benchmark for Indian origin a value add to current state of Indic Model Benchmarking?

⁠

If you disagree with above statement, please add a reason for the same?

⁠

Which Indic LLM benchmarks exist?

What we can observe from existing benchmarks:

Strong reliance on translation as a methodology to create benchmark data.

Human verification is generally as a quality control step post automated collection/ generation.

Most of the benchmarks use existing resources (Wikipedia, News, other datasets) as a primary source.

Recent benchmarks (2024) show more focus on systematic translation pipelines and verification processes.

Where can we innovate or add value?

We can collect or generate quality benchmark data which has not been built yet.

We can create a new evaluation metric to address the limitations of existing ones.

Below is a comprehensive list of Indic benchmarks and their details.

INDIC GLUE

News Category Classification

INDIC GLUE

Headline Prediction

INDIC GLUE

Wikipedia Section-Title Prediction

INDIC GLUE

Cloze-style QA

INDIC GLUE

Named Entity Recognition

INDIC GLUE

Cross-lingual Sentence Retrieval

INDIC GLUE

Winograd NLI

INDIC GLUE

COPA

INDIC GLUE

Paraphrase Detection

INDIC GLUE

Discourse Mode Classification

IndicXTREME

IndicSentiment

IndicXTREME

IndicXNLI

IndicXTREME

IndicCOPA

IndicXTREME

IndicXPara

IndicXTREME

M-Intent

IndicXTREME

Naamapadam

IndicXTREME

M-SlotFill

IndicXTREME

IndicQA

IndicXTREME

FLORES

IndicNLG

Biography Generation

IndicNLG

Headline Generation

IndicNLG

Sentence Summarization

IndicNLG

Paraphrase Generation

IndicNLG

Question Generation

ARC-Easy

General Knowledge QA

ARC-Challenge

Advanced Knowledge QA

Hellaswag

Common Sense Reasoning

MMLU

Multi-task Language Understanding

BoolQ

Yes/No Question Answering

CROSSSUM-IN

Cross-lingual Summarization

FLORES-IN

Machine Translation

XQUAD-IN

Multilingual QA

XORQA-IN-XX

Cross-lingual QA

XORQA-IN-EN

Cross-lingual QA

RECON

Cross-lingual Evaluation

INTEL

Training Data for Cross-lingual Evaluation

MILU

Multiple Choice Questions

INDIC GLUE

creator_org

AI4Bharat & IIT Madras

release_year

2,020

task

News Category Classification

task_type

Text Classification

method_brief

Semi-automatic creation using URL components for category labels

data_source

News websites

sample_input

News content

sample_output

Entertainment/Sports/Politics/etc.

scoring_method

Accuracy

languages_covered

9 languages

domains_covered

News

size_per_language

~3K-30K articles

total_size

125,630

literature

https://aclanthology.org/2020.findings-emnlp.445.pdf

dataset_link

https://huggingface.co/datasets/ai4bharat/indic_glue

⁠

Based on existing data sources we can categorise the sources as follows.

Summary of categories of data

Summary of categories of data

Limitations existing LLM evaluations

1/ Most of the current evaluations as well as training datasets are literal translations of existing English benchmarks.

Why does this matter?

People do not talk the way current datasets are

Language is much simpler in day to day usage

Meaning of translated sentence changes

task is to predict what caused the premise this happened

e.g Premise: “The girl ran out of energy.” > “लड़की ऊर्जा से भाग गई।”

e.g Premise: “The bolt tightened.” > “बोल्ट कड़ा हो गया।”

Can we say we need a better translation/ adaptation of benchmarks?

⁠

2/ Currently all the benchmarks operate on standardized versions of language and do not capture the dialect and regional nuances.

Can we say we need a better coverage of various regions and their usage of languages?

⁠

3/ Most of the datasets for both training and evaluation are created out of openly available newspapers, government websites, X language wiki, which is a very formal language.

Why does this matter?

These evaluations will penalize informal but valid language usages

Can we say we need a our benchmarks to cover usage of both formal and informal of languages?

⁠

4/ Translation of existing Machine translation and Natural language Interpretation benchmark datasets:

One of the benchmarks is a Translation of FLORES (Google's dataset for evaluating Machine translation) for understanding capabilities of Indic models.

Proper nouns that this dataset contains there is a high chance that Indian users will never use those nouns / entities : e.g “Mr. Rudd said XYZ”, “Bucharest City Hall had XYZ” and so on.

Can we say we need to collect information around of local places, names, objects which are specific to a certain region in order to create a more robust benchmarks?

⁠

5/ Metric used for Machine translation (Limitations of BLEU in Indic context)

Why does this matter?

Valid order of Indic language may be penalized because reference answer’s order

Honorifics and Number of references -

e.g “Did you eat food?” -> “Aapne khana khaya?”, “Tumne khana khaya?”

Both are valid depends on which reference translation is present in the dataset.

Existing benchmarks primarily have only one reference translation

Similarly Other tasks also have a reference answer against which LLM responses are compared.

Can we say that having multiple reference answers to capture varied responses of language model help in accurately benchmarking generation capabilities of LLM?

⁠

6/ Domain knowledge understanding and benchmarks

Currently we know that LLMs (Non Indic SOTA models) can accurately generate content and answer questions about multiple domains

Can we say that building a knowledge base of across multiple domains will be a valuable asset to benchmark Indic models in assessing knowledge based tasks?

⁠

Can we say that having a state school like curriculum tests for LLMs be a reflective measure of how smart a model is to say - ABC Indic model is smarter than 7th grader?

⁠

7/ In order to evaluate an Indic model at scale we currently have to rely heavily on Human evaluators and their expertise in the topic. Which can be unreliable and very time consuming effort.

Why this matters?

As the benchmark datasets grows both in size and diversity, finding human evaluators for this will not be feasible.

Can we say we need a mechanism which will act as “Indic - LLM Judge” to evaluate responses of models against a reference?

⁠

What are some precursors that would be needed to get started with “Indic - LLM Judge”?

⁠

8/ Current benchmarks do not distinguish between SLMs and LLMs, SLMs obviously lack in knowledge tasks, but can perform language tasks at par with LLMs

Can we say we need to have a different way of evaluating SLMs and LLMs in Indic space?

⁠

ASR model and evaluations

Introduction

This document examines Automatic Speech Recognition (ASR) systems, with special attention to their application in Indian languages. It analyzes current limitations in both Indian and non-Indian language contexts, and proposes ways to strengthen evaluation standards to create more reliable benchmarking for Indian language ASR models.

Moreover, it is designed as a collaborative platform where you can share your expertise and recommendations through an integrated feedback form. Your input will help shape better evaluation criteria for Indian language ASR systems based on practical experience and actual needs.

Why do we need Indic ASR evaluations and Benchmarks?

Vast Regional Language Diversity across India presents unique challenges with multiple speaking styles, accents, and dialects, making it crucial to evaluate ASR models against this complex linguistic landscape.

Lack of Standardized Evaluation Methods prevents reliable assessment of ASR model performance, making it difficult to quantify accuracy and effectiveness across different Indian languages and dialects.

Absence of a Centralized Comparison Platform limits developers and researchers from systematically identifying model limitations and benchmarking different ASR systems.

Which Non Indic ASR benchmarks exist?

LibriSpeech

1000 hours

Common Voice

21000 hours

VoxPopuli

2000 hours

TED-LIUM 3

452 hours

GigaSpeech

10,000 hours

SPGISpeech

5000 hours

Earnings-22

119 hours

AMI

100 hours

LibriSpeech

Size

1000 hours

Domain

Audiobooks

Input Format

16kHz FLAC files with speaker/chapter metadata

Output Format

• Text transcriptions • Word-level alignments • Speaker identifiers

Data Source

LibriVox Project

Key Features

• Clean read speech • Multiple speakers • Chapter-level organization • High-quality alignments

Reference

⁠

https://www.openslr.org/12

⁠

Which Indic ASR benchmarks exist?

Kathbath

1527 hours

Kathbath-Hard

1527 hours

FLEURS

90 hours

CommonVoice

373 hours

IndicTTS

192 hours

MUCS

403 hours

Gramvaani

100 hours

Kathbath

Size

Total across multiple languages*

1527 hours

Input Format

Android phone recordings

Output Format

• Text transcriptions

Domain

Wikipedia & News

Key Features

• Covers all 12 languages • Read speech collection • Mobile app based collection • Source text from Wikipedia & news

Data Source

AI4Bharat

Reference

https://github.com/AI4Bharat/vistaar

⁠

Though extensive coverage across languages is seen in the above datasets, there are some gaps which can be seen in the same:

Most of the datasets are read speech - lack spontaneity like in conversational speech

Limited coverage of domains - Education, Religion, Technical, Medical missing

Most of the recordings are in controlled environment and lack real world background noise

Not every language is equally assessed due to lack of data in relatively low resource languages

Do you agree with the above points?

⁠

Current state of Leaderboards

How does current Non Indic ASR leaderboard look like?

⁠

source :

https://huggingface.co/spaces/hf-audio/open_asr_leaderboard⁠

⁠

How does current Indic ASR leaderboard look like?

⁠

source :

https://asr.iitm.ac.in/leaderBoard⁠

(reports WER scores for various datasets)

Key Questions Unaddressed by Leaderboards

Typical ASR leaderboards showcase:

Word Error Rate (WER) across various datasets

Real-Time Factor (RTFx) for processing speed

Performance on standardized datasets (e.g., LibriSpeech, AMI, TedLium, Gigaspeec, etc)

But what are some key shortcomings of leaderboards like these?

1/ Context-Specific Performance Thresholds

Critical Question: What constitutes an acceptable WER for specific use cases?

Current limitations:

No industry-specific benchmarks

Lack of context for interpreting WER scores

No guidance on minimum acceptable performance thresholds

Examples of varying WER requirements:

Medical transcription: <5% WER typically required

Meeting transcription: <10% WER might be acceptable

Casual conversation: <15% WER could be sufficient

Can we say we need a standard metric for each domain and a better way to interpret it?

⁠

2/ Use Case Relevance

Critical Question: How well do benchmark datasets represent real-world applications?

Gaps in current evaluation:

Limited domain coverage

No industry-specific datasets (healthcare, legal, education)

Lack of real-world acoustic conditions

Missing everyday conversational scenarios

Absence of specialized vocabulary testing

No evaluation of handling domain-specific jargon

Can we say we need to benchmark ASR models on domain specific real world data?

⁠

3/ Speaker Recognition Capabilities

Critical Question: How well can the model handle speaker-related tasks?

Missing metrics for:

Speaker diarization accuracy

Speaker identification in multi-speaker scenarios

Performance with overlapping speech

Speaker verification capabilities

Handling of speaker transitions

Can we say we need to have speaker diarization metrics in ASR leaderboard?

⁠

4/ Mixed Language Handling

Critical Question: How well do systems handle code-switching and mixed language usage?

Current limitations:

No standardized metrics for code-switching accuracy

Limited evaluation of language switching fluency

Missing assessment of contextual language detection Examples of mixed language scenarios:

Hindi-English in business meetings

Regional language mixing in casual conversations

Technical terms in native language discourse

Can we say we need specific evaluation frameworks for code-mixing scenarios?

⁠

5/ Accent Understanding

Critical Question: How do we measure robustness across regional accents? Current limitations:

No standardized accent diversity metrics

Limited regional accent representation

Missing evaluation of accent adaptation capabilities Examples of accent variations:

Rural vs urban accents

Regional influences on pronunciation

Education-level impact on speech patterns

Can we say we need accent-specific evaluation criteria?

⁠

6/ NER Understanding

Critical Question: How accurately can systems identify and transcribe named entities?

Current limitations:

No specific metrics for named entity accuracy

Missing evaluation of context-based entity recognition

Limited assessment of proper noun handling Examples of NER challenges:

Person names in different languages

Location names with multiple pronunciations

Organization names with mixed language elements

Can we say we need dedicated NER accuracy metrics in ASR evaluation?

⁠

7/ Environment Change Handling

Authors

Introduction

Why do we need Indic LLM evaluations and Benchmarks?

Which Non Indic LLM benchmarks exist?

Which Indic LLM benchmarks exist?

Summary of categories of data

Limitations existing LLM evaluations

ASR model and evaluations

Introduction

Why do we need Indic ASR evaluations and Benchmarks?

Which Non Indic ASR benchmarks exist?

Which Indic ASR benchmarks exist?

Current state of Leaderboards

Key Questions Unaddressed by Leaderboards

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.