Health Pariksha provides an extensive assessment of 24 LLMs, examining their performance on
data collected from Indian patients interacting with a medical chatbot in Indian English and four
other Indic languages.
Key Highlights and Contributions
• Multilingual Evaluation: The study evaluates LLM responses to 750 questions posed by patients using a medical chatbot, covering five languages: Indian English, Hindi, Kannada, Tamil, and Telugu. Our dataset is unique, containing code-mixed queries such as “Agar operation ke baad pain ho raha hai, to kya karna hai?”, “Can I eat before the kanna operation”, and culturally relevant queries such as “Can I eat chapati/puri/non veg after surgery?”.
• Responses validated by doctors: We utilized doctor-validated responses as the ground truth for evaluating model responses.
• Uniform RAG Framework: All models were assessed using a uniform Retrieval Augmented Generation (RAG) framework, ensuring a consistent and fair comparison.
• Uncontaminated Dataset: The dataset used is free from contamination in the training data of the evaluated models, providing a reliable basis for assessment.
• Specialized Metrics: The evaluation was based on four metrics: factual correctness, semantic similarity, coherence, and conciseness, as well as a combined overall metric, chosen in consultation with domain experts and doctors. Both automated techniques and human evaluators were employed to ensure comprehensive assessment.
Leaderboard
Summary: The Qwen2.5-72B-Instruct is the best performing model on Indian English queries, followed by Phi-3.5-MoE-Instruct and Mistral-Large-Instruct-2407. Open-Aditi-hi-v4 and Llamavaad models also perform well on the task. The paper also contains leaderboards for the other languages.
Pariksha - health-pariksha
Model
AGG
COH
CON
FC
SS
Model
AGG
COH
CON
FC
SS
1
QWEN2.5-72B-INSTRUCT
1.46
1.86
1.96
1.62
1.43
2
GPT-4
1.4
1.71
1.95
1.56
1.36
3
PHI-3.5-MOE-INSTRUCT
1.29
1.65
1.93
1.43
1.22
4
MISTRAL-LARGE-INSTTRTUCT-2407
1.29
1.6
1.95
1.42
1.24
5
OPEN-ADITI-HI-V4
1.27
1.69
1.85
1.37
1.22
6
LLAMAVAAD
1.16
1.34
0.97
1.36
1.2
7
ARYABHATTA-GEMMAGENZ-VIKAS-MERGED
1.12
1.48
1.65
1.22
1.07
8
KAN-LLAMA-7B-SFT-V0.5
1.01
1.39
1.64
1.07
0.97
9
GEMMA-2-27B-IT
1
1.28
1.88
1.07
0.91
10
ARYABHATTA-GEMMAORCA-MERGED
0.97
1.32
1.62
1.03
0.92
11
LLAMA3-GAJA-HINDI-8B-V0.1
0.91
0.63
1.65
1.09
0.98
12
GPT-4O
0.91
1.08
1.78
0.98
0.87
13
AYA-23-35B
0.91
1.09
1.65
1
0.83
14
GAJENDRA-V0.1
0.88
1.21
1.38
0.93
0.85
15
C4AI-COMMAND-R-PLUS-08-2024
0.82
1.15
1.48
0.85
0.74
16
TAMIL-LLAMA-7B-INSTRUCT-V0.2
0.81
1.13
1.5
0.83
0.75
17
AIRAVATA
0.8
1.03
1.38
0.85
0.78
18
AMBARI-7B-INSTRUCT-V0.2
0.73
0.86
1.11
0.76
0.82
19
META-LLAMA-3.1-70B-INSTRUCT
0.65
0.55
1.12
0.77
0.67
20
TELUGU-LLAMA2-7B-V0-INSTRUCT
0.51
0.6
1.12
0.53
0.53
21
LLAMA38BGENZ_VIKAS-MERGERD
0.51
0.52
1.09
0.55
0.53
22
INDIC-GEMMA-7B-FINETUNED-SFT-NAVARASA-2.0
0.35
0.32
0.53
0.4
0.39
23
ARYABHATTA-GEMMAULTRA-MERGED
0.32
0.38
1.19
0.31
0.27
24
TELUGU-LLAMA-7B-INSTRUCT-V0.1
0.04
0
0.58
0.03
0
There are no rows in this table
Proprietary, Open-Weights and Indic models- all Indic models are open weights
AGG- aggregate over all tasks
FACTUAL CORRECTNESS (FC):As Doddapaneni et al. (2024) had shown that LLM-based
evaluators fail to identify subtle factual inaccuracies, we curate a separate metric to doublecheck facts like dates, numbers, procedure and
medicine names
SEMANTIC SIMILARITY (SS): Similarly, we formulate another metric to specifically analyse if both the prediction and the ground-truth response convey the same information semantically, especially when the they are in different languages.
COHERENCE (COH): This metric evaluates if the model was able to stitch together appropriate pieces of information from the three data chunks provided to yield a coherent response.
CONCISENESS (CON): Since the knowledge base chunks extracted and provided to the model can be quite large, with important facts embedded at different positions, we build this
metric to assess the ability of the model to extract and compress all these bits of information relevant to the query into a crisp response.
Key Findings
• Performance Variability: The study finds significant performance variability among models, with some smaller models outperforming larger ones.
• Language-Specific Performance: Indic models do not consistently perform well on Indic language queries, and factual correctness is generally lower for non-English queries. This shows that there is still work to be done to build models that can answer questions reliably Indian languages
• Locally-grounded, non-translated datasets: Our dataset includes various instances of code-switching, Indian English colloquialisms, and culturally specific questions which cannot be obtained by translating datasets, particularly with automated translations. While models were able to handle code-switching to a certain extent, responses varied greatly to culturally-relevant questions. This underscores the importance of collecting datasets from target populations while building solutions.
Future Directions
We plan to expand this research by incorporating more models and languages, broadening human evaluation, and continuing the development of realistic, culturally grounded benchmarks for multilingual model evaluation.