Authors
Sagar Sarkale[People+ai]
Last updated: 20th November, 2024
Introduction
This document primarily discusses the current state of LLM and Indic LLM evaluations, their shortcomings in both Indic and Non-Indic spaces, and how they can be improved to create a robust benchmark for the Indic model.
Additionally, this document serves as an interactive resource where you can contribute your insights and suggestions through the provided form, helping refine the thesis for evaluation of Indic LLMs based on real-world expertise and requirements.
Why do we need Indic LLM evaluations and Benchmarks?
No central platform for developers and the community in general to identify and use current best Indic models is missing, which in turn hampers the overall progress of Indic models. Majority of Indic models benchmarked on translated versions of existing English Benchmarks, rather than native content, this leads to a loss of missing patterns and cultural aspects unique to Indian languages. Absence of standardized metrics to state usage of Indic Models for a certain use case calls for recognising capabilities of these models under realistic scenarios specific to India to ensure these models meet needs of Indian end users. Which Non Indic LLM benchmarks exist?
LLM benchmarks try to assess a wide variety of capabilities of the model. Along with natural language generation and interpretation tasks, these benchmarks also try to evaluate knowledge in multiple domains. While the concepts in Math, Science, Common sense would still be the same in the context of Indian regions, way the questions are framed in India would be different.
Example 1 - Math
[English] “Alice brought 15 cookies for 150$ what is the price that she would have paid for 10 cookies?” [Hindi] "Riya ne 5 kilo tamatar 200 rupay ke bhav se kharida. Yadi woh 2 kilo tamatar khareedti toh kitne rupay kharch hote?" [Hindi] “रिया ने 5 किलो टमाटर 200₹ के भाव से खरीदा। यदि वह 2 किलो टमाटर खरीदती तो कितने रुपए खर्च होते?”
Example 2 - Common Sense
[English] “The trophy doesn't fit in the brown suitcase because it's too _. (small/big)” [Hindi] “Pinki ko Neetu ki Diwali ki mithai pasand nahi aayi kyonki woh bahot ___ thi। (khatti/mithi)” [Hindi] “पिंकी को नीतू की दिवाली की मिठाई पसंद नहीं आई क्योंकि वह बहुत ___ थी। (खट्टी/मीठी)”
The table below consists of various benchmarks and it’s adaptation to Indian origin.
Is adapting existing English Benchmark for Indian origin a value add to current state of Indic Model Benchmarking?
If you disagree with above statement, please add a reason for the same?
Which Indic LLM benchmarks exist?
What we can observe from existing benchmarks:
Strong reliance on translation as a methodology to create benchmark data. Human verification is generally as a quality control step post automated collection/ generation. Most of the benchmarks use existing resources (Wikipedia, News, other datasets) as a primary source. Recent benchmarks (2024) show more focus on systematic translation pipelines and verification processes.
Where can we innovate or add value?
We can collect or generate quality benchmark data which has not been built yet. We can create a new evaluation metric to address the limitations of existing ones.
Below is a comprehensive list of Indic benchmarks and their details.
Based on existing data sources we can categorise the sources as follows.
Summary of categories of data
What other categories or sources of data should we consider to build a more robust benchmark?
Limitations existing LLM evaluations
1/ Most of the current evaluations as well as training datasets are literal translations of existing English benchmarks.
People do not talk the way current datasets are Language is much simpler in day to day usage Meaning of translated sentence changes task is to predict what caused the premise this happened e.g Premise: “The girl ran out of energy.” > “लड़की ऊर्जा से भाग गई।” e.g Premise: “The bolt tightened.” > “बोल्ट कड़ा हो गया।”
Can we say we need a better translation/ adaptation of benchmarks?
2/ Currently all the benchmarks operate on standardized versions of language and do not capture the dialect and regional nuances.
Can we say we need a better coverage of various regions and their usage of languages?
3/ Most of the datasets for both training and evaluation are created out of openly available newspapers, government websites, X language wiki, which is a very formal language.
These evaluations will penalize informal but valid language usages
Can we say we need a our benchmarks to cover usage of both formal and informal of languages?
4/ Translation of existing Machine translation and Natural language Interpretation benchmark datasets:
One of the benchmarks is a Translation of FLORES (Google's dataset for evaluating Machine translation) for understanding capabilities of Indic models. Proper nouns that this dataset contains there is a high chance that Indian users will never use those nouns / entities : e.g “Mr. Rudd said XYZ”, “Bucharest City Hall had XYZ” and so on.
Can we say we need to collect information around of local places, names, objects which are specific to a certain region in order to create a more robust benchmarks?
5/ Metric used for Machine translation (Limitations of BLEU in Indic context)
Valid order of Indic language may be penalized because reference answer’s order Honorifics and Number of references - e.g “Did you eat food?” -> “Aapne khana khaya?”, “Tumne khana khaya?” Both are valid depends on which reference translation is present in the dataset. Existing benchmarks primarily have only one reference translation Similarly Other tasks also have a reference answer against which LLM responses are compared.
Can we say that having multiple reference answers to capture varied responses of language model help in accurately benchmarking generation capabilities of LLM?
6/ Domain knowledge understanding and benchmarks
Currently we know that LLMs (Non Indic SOTA models) can accurately generate content and answer questions about multiple domains
Can we say that building a knowledge base of across multiple domains will be a valuable asset to benchmark Indic models in assessing knowledge based tasks?
Can we say that having a state school like curriculum tests for LLMs be a reflective measure of how smart a model is to say - ABC Indic model is smarter than 7th grader?
7/ In order to evaluate an Indic model at scale we currently have to rely heavily on Human evaluators and their expertise in the topic. Which can be unreliable and very time consuming effort.
As the benchmark datasets grows both in size and diversity, finding human evaluators for this will not be feasible.
Can we say we need a mechanism which will act as “Indic - LLM Judge” to evaluate responses of models against a reference?
What are some precursors that would be needed to get started with “Indic - LLM Judge”?
8/ Current benchmarks do not distinguish between SLMs and LLMs, SLMs obviously lack in knowledge tasks, but can perform language tasks at par with LLMs
Can we say we need to have a different way of evaluating SLMs and LLMs in Indic space?
ASR model and evaluations
Introduction
This document examines Automatic Speech Recognition (ASR) systems, with special attention to their application in Indian languages. It analyzes current limitations in both Indian and non-Indian language contexts, and proposes ways to strengthen evaluation standards to create more reliable benchmarking for Indian language ASR models.
Moreover, it is designed as a collaborative platform where you can share your expertise and recommendations through an integrated feedback form. Your input will help shape better evaluation criteria for Indian language ASR systems based on practical experience and actual needs.
Why do we need Indic ASR evaluations and Benchmarks?
Vast Regional Language Diversity across India presents unique challenges with multiple speaking styles, accents, and dialects, making it crucial to evaluate ASR models against this complex linguistic landscape. Lack of Standardized Evaluation Methods prevents reliable assessment of ASR model performance, making it difficult to quantify accuracy and effectiveness across different Indian languages and dialects. Absence of a Centralized Comparison Platform limits developers and researchers from systematically identifying model limitations and benchmarking different ASR systems. Which Non Indic ASR benchmarks exist?
Which Indic ASR benchmarks exist?
Though extensive coverage across languages is seen in the above datasets, there are some gaps which can be seen in the same:
Most of the datasets are read speech - lack spontaneity like in conversational speech Limited coverage of domains - Education, Religion, Technical, Medical missing Most of the recordings are in controlled environment and lack real world background noise Not every language is equally assessed due to lack of data in relatively low resource languages