Gallery
Glocal Evaluation of Models
Share
Explore
Glocal Evaluation of Models

icon picker
Pariksha

By Ishaan Watts, Vivek Seshadri, Manohar Swaminathan, Sunayana Sitaram (Microsoft Research India)
Pariksha aims to evaluate the performance of large language models (LLMs) for Indic languages in a scalable, democratic, and transparent manner.

📐 Evaluation Method

The first Pariksha Pilot compares the responses of different LLMs to prompts that are curated to be relevant to Indian languages, culture and ethos. Instead of using traditional multilingual benchmarking techniques such as in our prior work MEGA [1] and MEGAVERSE [2], Pariksha leverages Karya, an ethical data collection platform to conduct large-scale high-quality human evaluation. The ranks obtained by human evaluation are converted into ELO scores to create the Pariksha leaderboard. We believe that current benchmarks are not sufficient to measure progress in Indic LLMs due to problems caused by contamination, benchmark translation and the lack of representative tasks in many traditional benchmarks. We plan to release all evaluation artifacts in order to enable the community to improve their models' using prompts, evaluation scores and preference data.

In addition to human evaluation, we also employ LLMs-as-evaluators by building upon new research on multilingual evaluation, METAL [3, 4]. This has the potential to augment human evaluation and increase the overall efficiency of the evaluation pipeline. We also present leaderboards created using LLMs as evaluators for the Pariksha Pilot.
megaphone
More details on the evaluation process can be found in the .

🎖️ Leaderboard with Various Views

info
The Pariksha Pilot was conducted in March 2024 and Round 1 is currently ongoing. The Round 1 leaderboard should be treated as a preview. We plan to add more models in subsequent rounds of Pariksha.
Pariksha Pilot Leaderboard

🎖️ Pariksha Round 1 Leaderboard

Language
Model
Language
Model
Human-Eval Rank
Human-Eval ELO
LLM-Eval Rank
LLM-Eval ELO
Kannada
12
Llama-3 70B
1
1406 ± 41.07
1
1,535 ± 41.51
AryaBhatta-GemmaOrca
2
1389 ± 39.42
4
1,433 ± 37.34
AryaBhatta-GemmaUltra
3
1363 ± 38.62
3
1,484 ± 45.15
GPT-4
4
1355 ± 41.19
2
1,524 ± 35.71
Kan-Llama
5
1269 ± 38.81
7
1,284 ± 41.86
Llama-3 8B
6
1264 ± 39.74
6
1,310 ± 36.03
Ambari
7
1259 ± 43.97
9
1,210 ± 33.2
Navarasa
8
1249 ± 41.34
5
1,362 ± 40.53
GPT-3.5-Turbo
9
1166 ± 40.55
8
1,220 ± 32.33
Gemma 7B
10
979 ± 41.26
10
1,070 ± 38.61
Mistral 7B
11
927 ± 38.84
11
866 ± 30.41
Llama-2 7B
12
800 ± 0
12
800 ± 0
Tamil
12
Llama-3 70B
1
1271 ± 34.7
4
1,505 ± 56.71
AryaBhatta-GemmaUltra
2
1178 ± 30.24
6
1,470 ± 61.19
AryaBhatta-GemmaOrca
3
1176 ± 28.56
2
1,542 ± 58.75
Navarasa
4
1173 ± 29.98
3
1,533 ± 62.04
GPT-4
5
1138 ± 31.83
5
1,505 ± 55.02
abhinand-Tamil
6
1132 ± 28.16
1
1,587 ± 60.79
Llama-3 8B
7
1046 ± 27.88
8
1,199 ± 52.73
SamwaadLLM
8
1037 ± 25.02
7
1,364 ± 58.4
GPT-3.5-Turbo
9
955 ± 27.61
10
1,167 ± 47.13
Gemma 7B
10
941 ± 27.81
9
1,168 ± 55.65
Mistral 7B
11
809 ± 24.11
12
717 ± 42.36
Llama-2 7B
12
800 ± 0
11
800 ± 0

📊 Summary of Data Points

The tables below summarize the number of models, languages and data points included in the Pariksha Pilot leaderboard.
help
What is a Data Point?
A data point is a single battle, where an evaluator is shown a prompt with responses from two LLMs and asked to pick which one is better, or tie.
Summary for Human-Eval (Karya)
Language
# Models
Total # human evaluation data points
1
Hindi
8
1,680
2
Malayalam
7
1,260
3
Telugu
7
1,260
4
Kannada
7
1,260
5
Tamil
6
900
There are no rows in this table
Summary for LLM-Eval
Language
# Models
Total # LLM evaluation data points
1
English
15
14,700
2
Hindi
8
1,680
3
Malayalam
7
1,260
4
Telugu
7
1,260
5
Kannada
7
1,260
6
Tamil
6
900
There are no rows in this table

🧵 References

[1] Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. 2023. . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computational Linguistics.
[2] Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, and Sunayana Sitaram. 2024.
[3] Rishav Hada, Varun Gumma, Adrian Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, and Sunayana Sitaram. 2024. . In Findings of the Association for Computational Linguistics: EACL 2024, pages 1051–1070, St. Julian’s, Malta. Association for Computational Linguistics.
[4] Rishav Hada, Varun Gumma, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. 2024.

Share Feedback

Reach out to the Pariksha team to share feedback.
Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.