Pariksha aims to evaluate the performance of large language models (LLMs) for Indic languages in a scalable, democratic, and transparent manner.
📐 Evaluation Method
The first Pariksha Pilot compares the responses of different LLMs to prompts that are curated to be relevant to Indian languages, culture and ethos. Instead of using traditional multilingual benchmarking techniques such as in our prior work MEGA [1] and MEGAVERSE [2], Pariksha leverages Karya, an ethical data collection platform to conduct large-scale high-quality human evaluation. The ranks obtained by human evaluation are converted into ELO scores to create the Pariksha leaderboard. We believe that current benchmarks are not sufficient to measure progress in Indic LLMs due to problems caused by contamination, benchmark translation and the lack of representative tasks in many traditional benchmarks. We plan to release all evaluation artifacts in order to enable the community to improve their models' using prompts, evaluation scores and preference data.
In addition to human evaluation, we also employ LLMs-as-evaluators by building upon new research on multilingual evaluation, METAL [3, 4]. This has the potential to augment human evaluation and increase the overall efficiency of the evaluation pipeline. We also present leaderboards created using LLMs as evaluators for the Pariksha Pilot.
More details on the evaluation process can be found in the
The Pariksha Pilot was conducted in March 2024 and Round 1 is currently ongoing. The Round 1 leaderboard should be treated as a preview. We plan to add more models in subsequent rounds of Pariksha.
Pariksha Pilot Leaderboard
🎖️ Pariksha Round 1 Leaderboard
Language
Model
Language
Model
Human-Eval Rank
Human-Eval ELO
LLM-Eval Rank
LLM-Eval ELO
Language
Model
Human-Eval Rank
Human-Eval ELO
LLM-Eval Rank
LLM-Eval ELO
Kannada
12
Llama-3 70B
1
1406 ± 41.07
1
1,535 ± 41.51
AryaBhatta-GemmaOrca
2
1389 ± 39.42
4
1,433 ± 37.34
AryaBhatta-GemmaUltra
3
1363 ± 38.62
3
1,484 ± 45.15
GPT-4
4
1355 ± 41.19
2
1,524 ± 35.71
Kan-Llama
5
1269 ± 38.81
7
1,284 ± 41.86
Llama-3 8B
6
1264 ± 39.74
6
1,310 ± 36.03
Ambari
7
1259 ± 43.97
9
1,210 ± 33.2
Navarasa
8
1249 ± 41.34
5
1,362 ± 40.53
GPT-3.5-Turbo
9
1166 ± 40.55
8
1,220 ± 32.33
Gemma 7B
10
979 ± 41.26
10
1,070 ± 38.61
Mistral 7B
11
927 ± 38.84
11
866 ± 30.41
Llama-2 7B
12
800 ± 0
12
800 ± 0
Tamil
12
Llama-3 70B
1
1271 ± 34.7
4
1,505 ± 56.71
AryaBhatta-GemmaUltra
2
1178 ± 30.24
6
1,470 ± 61.19
AryaBhatta-GemmaOrca
3
1176 ± 28.56
2
1,542 ± 58.75
Navarasa
4
1173 ± 29.98
3
1,533 ± 62.04
GPT-4
5
1138 ± 31.83
5
1,505 ± 55.02
abhinand-Tamil
6
1132 ± 28.16
1
1,587 ± 60.79
Llama-3 8B
7
1046 ± 27.88
8
1,199 ± 52.73
SamwaadLLM
8
1037 ± 25.02
7
1,364 ± 58.4
GPT-3.5-Turbo
9
955 ± 27.61
10
1,167 ± 47.13
Gemma 7B
10
941 ± 27.81
9
1,168 ± 55.65
Mistral 7B
11
809 ± 24.11
12
717 ± 42.36
Llama-2 7B
12
800 ± 0
11
800 ± 0
MLE Elo by Language
Pariksha - MLE Elo for Bengali
Model
Rank(Human)
Elo Rating(Human)
Rank(LLM)
Elo Rating(LLM)
Model
Rank(Human)
Elo Rating(Human)
Rank(LLM)
Elo Rating(LLM)
1
GPT-4o
1
1551 ± 18.95
3
1604 ± 22.73
2
Llama-3 70B
2
1444 ± 13.09
5
1538 ± 18.09
3
Gemini-Pro 1.0
3
1444 ± 15.93
2
1672 ± 21.87
4
GPT-4o
4
1346 ± 12.59
4
1598 ± 20.44
5
SamwaadLLM
5
1247 ± 11.98
1
1688 ± 21.51
6
Llama-3 8B
6
1116 ± 12.37
6
1233 ± 16.0
7
Navarassa
7
1095 ± 12.27
11
955 ± 12.85
8
AryaBhatta-GemmaOrca
8
1067 ± 10.7
10
975 ± 12.91
9
AryaBhatta-Llama3Genz
9
1066 ± 10.17
7
1157 ± 14.33
10
GPT-3.5-Turrbo
10
1053 ± 10.71
8
1086 ± 13.49
11
AryaBhatta-Gemma Ultra
11
1025 ± 10.88
12
935 ± 13.08
12
OdiaGenAI-Bengali
12
860 ± 9.39
15
719 ± 11.09
13
Gemma 7B
13
859 ± 9.29
9
1029 ± 14.42
14
Mistal 7B
14
821 ± 8.97
13
891 ± 12.85
15
Llama-2 7B
15
800 ± 0.0
14
800 ± 0.0
There are no rows in this table
Pariksha - MLE Elo for Gujarati
Model
Rank(Human)
Elo Rating(Human)
Rank(LLM)
Elo Rating(LLM)
Model
Rank(Human)
Elo Rating(Human)
Rank(LLM)
Elo Rating(LLM)
1
GPT-4o
1
1399 ± 15.59
1
1787 ± 25.14
2
Llama-3 70B
2
1360 ± 13.1
3
1704 ± 22.18
3
GPT-4
3
1286 ± 11.68
4
1675 ± 20.88
4
SamwaadLLM
4
1246 ± 11.78
2
1748 ± 23.56
5
AryaBhatta-Llama3Genz
5
1126 ± 10.1
5
1441 ± 19.08
6
Navarasa
6
1113 ± 12.37
7
1237 ± 19.66
7
AryaBhatta-GemmaOrca
7
1108 ± 11.7
6
1285 ± 21.41
8
AryaBhatta-Gemma Ultra
8
1061 ± 10.21
10
1175 ± 19.59
9
GPT-3.5-Turbo
9
1042 ± 11.21
9
1223 ± 18.16
10
Llama-3 8B
10
995 ± 9.55
8
1235 ± 18.73
11
Gemma 7B
11
815 ± 8.83
11
1028 ± 16.83
12
Llama-2 7B
12
800 ± 0.0
12
800 ± 0.0
13
Mistral 7B
13
797 ± 8.24
13
747 ± 12.67
There are no rows in this table
Pariksha - MLE Elo for Hindi
Model
Rank(Human)
Elo Rating(Human)
Rank(LLM)
Elo Rating(LLM)
Model
Rank(Human)
Elo Rating(Human)
Rank(LLM)
Elo Rating(LLM)
1
GPT-4o
1
1607 ± 16.12
1
1769 ± 20.48
2
Aya-23 35B
2
1549 ± 14.69
3
1597 ± 16.51
3
SamwaadLLM
3
1521 ± 14.49
4
1575 ± 18.22
4
Llama-3 70B
4
1457 ± 10.97
6
1440 ± 14.49
5
Gemini-Pro 1.0
5
1454 ± 12.79
2
1618 ± 18.73
6
GPT-4
6
1407 ± 13.03
5
1446 ± 15.92
7
AryaBhatta-GemmaOrca
7
1278 ± 12.07
11
1169 ± 14.37
8
AryaBhatta-GemmaUltra
8
1260 ± 12.4
10
1172 ± 13.96
9
Navarasa
9
1259 ± 12.59
9
1192 ± 14.48
10
AryaBhatta-Llama3GenZ
10
1225 ± 10.79
7
1240 ± 13.45
11
AryaBhatta-GemmaGenZ
11
1205 ± 11.82
14
1065 ± 14.4
12
Llama-3 8B
12
1177 ± 10.64
12
1161 ± 13.64
13
Llamavaad
13
1169 ± 12.2
8
1238 ± 15.17
14
Gajendra
14
1158 ± 9.78
13
1153 ± 15.76
15
Airavata
15
1129 ± 11.95
17
996 ± 14.63
16
Gemma 7B
16
1070 ± 11.79
15
1034 ± 12.62
17
GPT-3.5-Turbo
17
1024 ± 12.76
16
996 ± 14.75
18
Open-Aditi
18
944 ± 11.24
18
939 ± 13.36
19
Mistral 7B
19
921 ± 11.98
19
830 ± 14.48
20
Llama-2 7B
20
800 ± 0.0
20
800 ± 0.0
There are no rows in this table
Pariksha - MLE Elo for Kannada
Model
Rank(Human)
Elo Rating(Human)
Rank(LLM)
Elo Rating(LLM)
Model
Rank(Human)
Elo Rating(Human)
Rank(LLM)
Elo Rating(LLM)
1
Llama-3 70B
1
1420 ± 18.35
2
1571 ± 18.88
2
AryaBhatta-GemmaOrca
2
1406 ± 18.03
5
1465 ± 19.95
3
AryaBhatta-GemmaUltra
3
1395 ± 15.7
4
1520 ± 19.85
4
GPT-4o
4
1337 ± 16.62
1
1676 ± 18.78
5
GPT-4
5
1328 ± 17.52
3
1560 ± 17.8
6
Kan-Llama
6
1286 ± 16.44
9
1298 ± 17.18
7
Navarasa
7
1285 ± 16.56
6
1379 ± 16.73
8
AryaBhatta-Llama3GenZ
8
1261 ± 15.03
7
1352 ± 16.73
9
Ambari
9
1246 ± 15.25
11
1218 ± 16.42
10
Llama-3 8B
10
1246 ± 15.34
8
1331 ± 16.02
11
GPT-3.5-Turbo
11
1162 ± 15.08
10
1223 ± 14.43
12
Gemma 7B
12
967 ± 14.35
12
1088 ± 15.25
13
Mistral 7B
13
847 ± 16.92
13
864 ± 15.21
14
Llama-2 7B
14
800 ± 0.0
14
800 ± 0.0
There are no rows in this table
Pariksha - MLE Elo for Malayalam
Model
Rank(Human)
Elo Rating(Human)
Rank(LLM)
Elo Rating(LLM)
Model
Rank(Human)
Elo Rating(Human)
Rank(LLM)
Elo Rating(LLM)
1
GPT-4o
1
1332 ± 13.5
1
1777 ± 21.68
2
Llama-3 70B
2
1271 ± 11.21
3
1484 ± 16.7
3
AryaBhatta-GemmaOrca
3
1216 ± 12.55
4
1361 ± 16.75
4
GPT-4
4
1200 ± 11.42
2
1660 ± 23.11
5
Navarasa
5
1195 ± 11.04
5
1299 ± 17.24
6
AryaBhatta-GemmaUltra
6
1150 ± 11.38
8
1246 ± 17.02
7
abhinand-Malayalam
7
1134 ± 10.64
7
1249 ± 17.1
8
MalayaLLM
8
1082 ± 9.65
10
1208 ± 15.95
9
AryaBhatta-Llama3GenZ
9
1080 ± 9.11
6
1261 ± 14.73
10
Llama-3 8B
10
991 ± 10.94
9
1209 ± 14.03
11
GPT-3.5-Turbo
11
859 ± 8.73
11
1078 ± 15.92
12
Gemma 7B
12
831 ± 8.0
12
975 ± 15.71
13
Mistral 7B
13
819 ± 7.65
14
788 ± 13.46
14
Llama-2 7B
14
800 ± 0.0
13
800 ± 0.0
There are no rows in this table
Pariksha - MLE Elo for Marathi
Model
Rank(Human)
Elo Rating(Human)
Rank(LLM)
Elo Rating(LLM)
Model
Rank(Human)
Elo Rating(Human)
Rank(LLM)
Elo Rating(LLM)
1
GPT-4o
1
1416 ± 16.63
1
1845 ± 24.03
2
Llama-3 70B
2
1279 ± 15.11
3
1592 ± 22.7
3
GPT-4
3
1138 ± 9.34
2
1628 ± 22.97
4
SamwaadLLM
4
1018 ± 9.63
4
1458 ± 22.44
5
Navarasa
5
994 ± 8.76
5
1303 ± 16.79
6
Llama-3 8B
6
929 ± 8.98
6
1199 ± 18.68
7
Misal
7
893 ± 8.2
9
988 ± 15.5
8
GPT-3.5-Turbo
8
865 ± 7.3
7
1199 ± 16.66
9
AryaBhatta-Llama3GenZ
9
828 ± 7.01
10
922 ± 17.13
10
Mistral 7B
10
808 ± 6.36
11
890 ± 14.68
11
Llama-2 7B
11
800 ± 0.0
12
800 ± 0.0
12
Gemma 7B
12
798 ± 6.69
8
1033 ± 16.48
There are no rows in this table
Pariksha - MLE Elo for Punjabi
Model
Rank(Human)
Elo Rating(Human)
Rank(LLM)
Elo Rating(LLM)
Model
Rank(Human)
Elo Rating(Human)
Rank(LLM)
Elo Rating(LLM)
1
GPT-4o
1
1315 ± 13.65
1
1782 ± 25.54
2
Llama-3 70B
2
1308 ± 14.35
2
1736 ± 22.23
3
GPT-4
3
1258 ± 11.55
3
1725 ± 21.35
4
Navarasa
4
1001 ± 7.48
6
1351 ± 17.74
5
AryaBhatta-GemmaUltra
5
996 ± 9.27
10
1272 ± 17.76
6
AryaBhatta-GemmaOrca
6
958 ± 7.82
7
1311 ± 18.26
7
SamwaadLLM
7
951 ± 9.02
4
1460 ± 19.91
8
GPT-3.5-Turbo
8
913 ± 7.13
8
1307 ± 18.67
9
Llama-3 8B
9
902 ± 7.1
9
1301 ± 18.22
10
AryaBhatta-Llama3GenZ
10
892 ± 8.46
5
1384 ± 18.88
11
Gemma 7B
11
807 ± 6.05
11
1018 ± 14.37
12
Mistral 7B
12
804 ± 6.81
13
777 ± 14.38
13
Llama-2 7B
13
800 ± 0.0
12
800 ± 0.0
There are no rows in this table
Pariksha - MLE Elo for Odia
Model
Rank (Human)
Elo Rating (Human)
Rank (LLM)
Elo Rating (LLM)
Model
Rank (Human)
Elo Rating (Human)
Rank (LLM)
Elo Rating (LLM)
1
GPT-4o
1
1371 ± 14.76
1
1676 ± 18.56
2
Llama-3 70B
2
1303 ± 12.12
3
1429 ± 15.77
3
Navarasa
3
1232 ± 11.47
4
1313 ± 16.07
4
AryaBhatta-GemmaOrca
4
1221 ± 11.32
5
1312 ± 16.51
5
AryaBhatta-GemmaUltra
5
1191 ± 10.69
9
1220 ± 14.25
6
GPT-4
6
1171 ± 11.67
2
1516 ± 14.56
7
AryaBhatta-Llama3GenZ
7
1084 ± 9.49
8
1228 ± 14.01
8
Llama-3 8B
8
1064 ± 8.78
7
1244 ± 13.12
9
SamwaadLLM
9
983 ± 9.61
6
1250 ± 14.18
10
GPT-3.5-Turbo
10
926 ± 9.71
10
1180 ± 13.17
11
OdiaGenAI-Odia
11
887 ± 8.38
11
942 ± 12.35
12
Llama-2 7B
12
800 ± 0.0
12
800 ± 0.0
13
Mistral 7B
13
796 ± 7.44
13
799 ± 10.55
14
Gemma 7B
14
780 ± 8.14
14
633 ± 12.43
There are no rows in this table
Pariksha - MLE for Tamil
Model
Rank (Human)
Elo Rating (Human)
Rank (LLM)
Elo Rating (LLM)
Model
Rank (Human)
Elo Rating (Human)
Rank (LLM)
Elo Rating (LLM)
1
Llama-3 70B
1
1342 ± 11.52
5
1520 ± 19.02
2
GPT-4o
2
1287 ± 12.37
1
1703 ± 21.88
3
AryaBhatta-GemmaOrca
3
1271 ± 10.5
4
1531 ± 21.17
4
AryaBhatta-GemmaUltra
4
1258 ± 12.15
7
1478 ± 21.58
5
Navarasa
5
1221 ± 9.7
3
1541 ± 22.22
6
GPT-4
6
1176 ± 9.67
6
1519 ± 19.36
7
AryaBhatta-Llama3GenZ
7
1142 ± 11.18
8
1377 ± 19.37
8
abhinand-Tamil
8
1126 ± 10.09
2
1559 ± 21.2
9
SamwaadLLM
9
1054 ± 9.53
9
1362 ± 20.22
10
Llama-3 8B
10
1043 ± 10.39
10
1177 ± 18.64
11
Gemma 7B
11
940 ± 9.61
11
1166 ± 18.55
12
GPT-3.5-Turbo
12
932 ± 8.9
12
1126 ± 17.95
13
Mistral 7B
13
819 ± 9.41
14
697 ± 13.77
14
Llama-2 7B
14
800 ± 0.0
13
800 ± 0.0
There are no rows in this table
Pariksha - MLE for Telugu
Model
Rank (Human)
Elo Rating (Human)
Rank (LLM)
Elo Rating (LLM)
Model
Rank (Human)
Elo Rating (Human)
Rank (LLM)
Elo Rating (LLM)
1
Llama-3 70B
1
1313 ± 11.74
3
1565 ± 17.29
2
GPT-4o
2
1294 ± 12.35
2
1625 ± 17.26
3
AryaBhatta-GemmaOrca
3
1276 ± 12.74
4
1515 ± 15.96
4
AryaBhatta-GemmaUltra
4
1258 ± 12.96
6
1492 ± 16.83
5
Navarasa
5
1184 ± 11.61
5
1503 ± 16.93
6
GPT-4
6
1154 ± 9.98
1
1634 ± 17.24
7
Llama-3 8B
7
1100 ± 11.59
10
1336 ± 14.95
8
AryaBhatta-Llama3GenZ
8
1089 ± 10.07
8
1383 ± 12.92
9
SamwaadLLM
9
1074 ± 10.21
7
1433 ± 15.88
10
abhinand-Telugu
10
1040 ± 10.55
9
1341 ± 17.27
11
GPT-3.5-Turbo
11
834 ± 8.12
12
1193 ± 15.14
12
Llama-2 7B
12
800 ± 0.0
14
800 ± 0.0
13
TLL-Telugu
13
798 ± 7.47
13
868 ± 10.85
14
Mistral 7B
14
784 ± 6.67
15
785 ± 10.3
15
Gemma 7B
15
784 ± 7.11
11
1261 ± 16.11
There are no rows in this table
📊 Summary of Data Points
The tables below summarize the number of models, languages and data points included in the Pariksha Pilot leaderboard.
What is a Data Point?
A data point is a single battle, where an evaluator is shown a prompt with responses from two LLMs and asked to pick which one is better, or tie.
. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computational Linguistics.
. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1051–1070, St. Julian’s, Malta. Association for Computational Linguistics.