Pariksha aims to evaluate the performance of large language models (LLMs) for Indic languages in a scalable, democratic, and transparent manner.
📐 Evaluation Method
The first Pariksha Pilot compares the responses of different LLMs to prompts that are curated to be relevant to Indian languages, culture and ethos. Instead of using traditional multilingual benchmarking techniques such as in our prior work MEGA [1] and MEGAVERSE [2], Pariksha leverages Karya, an ethical data collection platform to conduct large-scale high-quality human evaluation. The ranks obtained by human evaluation are converted into ELO scores to create the Pariksha leaderboard. We believe that current benchmarks are not sufficient to measure progress in Indic LLMs due to problems caused by contamination, benchmark translation and the lack of representative tasks in many traditional benchmarks. We plan to release all evaluation artifacts in order to enable the community to improve their models' using prompts, evaluation scores and preference data.
In addition to human evaluation, we also employ LLMs-as-evaluators by building upon new research on multilingual evaluation, METAL [3, 4]. This has the potential to augment human evaluation and increase the overall efficiency of the evaluation pipeline. We also present leaderboards created using LLMs as evaluators for the Pariksha Pilot.
More details on the evaluation process can be found in the
The Pariksha Pilot was conducted in March 2024 and Round 1 is currently ongoing. The Round 1 leaderboard should be treated as a preview. We plan to add more models in subsequent rounds of Pariksha.
Pariksha Pilot Leaderboard
🎖️ Pariksha Round 1 Leaderboard
Language
Model
MLE Elo by Language
Pariksha - MLE Elo for Bengali
There are no rows in this table
Pariksha - MLE Elo for Gujarati
There are no rows in this table
Pariksha - MLE Elo for Hindi
There are no rows in this table
Pariksha - MLE Elo for Kannada
There are no rows in this table
Pariksha - MLE Elo for Malayalam
There are no rows in this table
Pariksha - MLE Elo for Marathi
There are no rows in this table
Pariksha - MLE Elo for Punjabi
There are no rows in this table
Pariksha - MLE Elo for Odia
There are no rows in this table
Pariksha - MLE for Tamil
There are no rows in this table
Pariksha - MLE for Telugu
There are no rows in this table
📊 Summary of Data Points
The tables below summarize the number of models, languages and data points included in the Pariksha Pilot leaderboard.
What is a Data Point?
A data point is a single battle, where an evaluator is shown a prompt with responses from two LLMs and asked to pick which one is better, or tie.
. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computational Linguistics.
. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1051–1070, St. Julian’s, Malta. Association for Computational Linguistics.