By Ishaan Watts, Vivek Seshadri, Manohar Swaminathan, Sunayana Sitaram (Microsoft Research India)
Pariksha aims to evaluate the performance of large language models (LLMs) for Indic languages in a scalable, democratic, and transparent manner.
📐 Evaluation Method
The first Pariksha Pilot compares the responses of different LLMs to prompts that are curated to be relevant to Indian languages, culture and ethos. Instead of using traditional multilingual benchmarking techniques such as in our prior work MEGA [1] and MEGAVERSE [2], Pariksha leverages Karya, an ethical data collection platform to conduct large-scale high-quality human evaluation. The ranks obtained by human evaluation are converted into ELO scores to create the Pariksha leaderboard. We believe that current benchmarks are not sufficient to measure progress in Indic LLMs due to problems caused by contamination, benchmark translation and the lack of representative tasks in many traditional benchmarks. We plan to release all evaluation artifacts in order to enable the community to improve their models' using prompts, evaluation scores and preference data.
In addition to human evaluation, we also employ LLMs-as-evaluators by building upon new research on multilingual evaluation, METAL [3, 4]. This has the potential to augment human evaluation and increase the overall efficiency of the evaluation pipeline. We also present leaderboards created using LLMs as evaluators for the Pariksha Pilot.
More details on the evaluation process can be found in the
The Pariksha Pilot was conducted in March 2024 and Round 1 is currently ongoing. The Round 1 leaderboard should be treated as a preview. We plan to add more models in subsequent rounds of Pariksha.
Pariksha Pilot Leaderboard
🎖️ Pariksha Round 1 Leaderboard
Language
Model
Language
Model
Human-Eval Rank
Human-Eval ELO
LLM-Eval Rank
LLM-Eval ELO
Language
Model
Human-Eval Rank
Human-Eval ELO
LLM-Eval Rank
LLM-Eval ELO
Kannada
12
Llama-3 70B
1
1406 ± 41.07
1
1,535 ± 41.51
AryaBhatta-GemmaOrca
2
1389 ± 39.42
4
1,433 ± 37.34
AryaBhatta-GemmaUltra
3
1363 ± 38.62
3
1,484 ± 45.15
GPT-4
4
1355 ± 41.19
2
1,524 ± 35.71
Kan-Llama
5
1269 ± 38.81
7
1,284 ± 41.86
Llama-3 8B
6
1264 ± 39.74
6
1,310 ± 36.03
Ambari
7
1259 ± 43.97
9
1,210 ± 33.2
Navarasa
8
1249 ± 41.34
5
1,362 ± 40.53
GPT-3.5-Turbo
9
1166 ± 40.55
8
1,220 ± 32.33
Gemma 7B
10
979 ± 41.26
10
1,070 ± 38.61
Mistral 7B
11
927 ± 38.84
11
866 ± 30.41
Llama-2 7B
12
800 ± 0
12
800 ± 0
Tamil
12
Llama-3 70B
1
1271 ± 34.7
4
1,505 ± 56.71
AryaBhatta-GemmaUltra
2
1178 ± 30.24
6
1,470 ± 61.19
AryaBhatta-GemmaOrca
3
1176 ± 28.56
2
1,542 ± 58.75
Navarasa
4
1173 ± 29.98
3
1,533 ± 62.04
GPT-4
5
1138 ± 31.83
5
1,505 ± 55.02
abhinand-Tamil
6
1132 ± 28.16
1
1,587 ± 60.79
Llama-3 8B
7
1046 ± 27.88
8
1,199 ± 52.73
SamwaadLLM
8
1037 ± 25.02
7
1,364 ± 58.4
GPT-3.5-Turbo
9
955 ± 27.61
10
1,167 ± 47.13
Gemma 7B
10
941 ± 27.81
9
1,168 ± 55.65
Mistral 7B
11
809 ± 24.11
12
717 ± 42.36
Llama-2 7B
12
800 ± 0
11
800 ± 0
📊 Summary of Data Points
The tables below summarize the number of models, languages and data points included in the Pariksha Pilot leaderboard.
What is a Data Point?
A data point is a single battle, where an evaluator is shown a prompt with responses from two LLMs and asked to pick which one is better, or tie.
. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computational Linguistics.
. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1051–1070, St. Julian’s, Malta. Association for Computational Linguistics.