Glocal Evaluation of Models

Explore

Glocal Evaluation of Models

Glocal Evaluation of Models

Health Pariksha

Health Pariksha: A Comprehensive Evaluation of RAG Models for Health Chatbots in a Real-world Multilingual Setting

Varun Gumma, Anandhita Raghunath, Mohit Jain, Sunayana Sitaram

Read the paper on ArXiv:

Open hyperlink

⁠

Health Pariksha provides an extensive assessment of 24 LLMs, examining their performance on data collected from Indian patients interacting with a medical chatbot in Indian English and four other Indic languages.

Key Highlights and Contributions

• Multilingual Evaluation: The study evaluates LLM responses to 750 questions posed by patients using a medical chatbot, covering five languages: Indian English, Hindi, Kannada, Tamil, and Telugu. Our dataset is unique, containing code-mixed queries such as “Agar operation ke baad pain ho raha hai, to kya karna hai?”, “Can I eat before the kanna operation”, and culturally relevant queries such as “Can I eat chapati/puri/non veg after surgery?”. • Responses validated by doctors: We utilized doctor-validated responses as the ground truth for evaluating model responses. • Uniform RAG Framework: All models were assessed using a uniform Retrieval Augmented Generation (RAG) framework, ensuring a consistent and fair comparison. • Uncontaminated Dataset: The dataset used is free from contamination in the training data of the evaluated models, providing a reliable basis for assessment. • Specialized Metrics: The evaluation was based on four metrics: factual correctness, semantic similarity, coherence, and conciseness, as well as a combined overall metric, chosen in consultation with domain experts and doctors. Both automated techniques and human evaluators were employed to ensure comprehensive assessment.

Leaderboard

Summary: The Qwen2.5-72B-Instruct is the best performing model on Indian English queries, followed by Phi-3.5-MoE-Instruct and Mistral-Large-Instruct-2407. Open-Aditi-hi-v4 and Llamavaad models also perform well on the task. The paper also contains leaderboards for the other languages.

Pariksha - health-pariksha

Pariksha - health-pariksha

Model

AGG

COH

CON

QWEN2.5-72B-INSTRUCT

1.46

1.86

1.96

1.62

1.43

GPT-4

1.4

1.71

1.95

1.56

1.36

PHI-3.5-MOE-INSTRUCT

1.29

1.65

1.93

1.43

1.22

MISTRAL-LARGE-INSTTRTUCT-2407

1.29

1.6

1.95

1.42

1.24

OPEN-ADITI-HI-V4

1.27

1.69

1.85

1.37

1.22

LLAMAVAAD

1.16

1.34

0.97

1.36

1.2

ARYABHATTA-GEMMAGENZ-VIKAS-MERGED

1.12

1.48

1.65

1.22

1.07

KAN-LLAMA-7B-SFT-V0.5

1.01

1.39

1.64

1.07

0.97

GEMMA-2-27B-IT

1.28

1.88

1.07

0.91

ARYABHATTA-GEMMAORCA-MERGED

0.97

1.32

1.62

1.03

0.92

LLAMA3-GAJA-HINDI-8B-V0.1

0.91

0.63

1.65

1.09

0.98

GPT-4O

0.91

1.08

1.78

0.98

0.87

AYA-23-35B

0.91

1.09

1.65

0.83

GAJENDRA-V0.1

0.88

1.21

1.38

0.93

0.85

C4AI-COMMAND-R-PLUS-08-2024

0.82

1.15

1.48

0.85

0.74

TAMIL-LLAMA-7B-INSTRUCT-V0.2

0.81

1.13

1.5

0.83

0.75

AIRAVATA

0.8

1.03

1.38

0.85

0.78

AMBARI-7B-INSTRUCT-V0.2

0.73

0.86

1.11

0.76

0.82

META-LLAMA-3.1-70B-INSTRUCT

0.65

0.55

1.12

0.77

0.67

TELUGU-LLAMA2-7B-V0-INSTRUCT

0.51

0.6

1.12

0.53

LLAMA38BGENZ_VIKAS-MERGERD

0.51

0.52

1.09

0.55

0.53

INDIC-GEMMA-7B-FINETUNED-SFT-NAVARASA-2.0

0.35

0.32

0.53

0.4

0.39

ARYABHATTA-GEMMAULTRA-MERGED

0.32

0.38

1.19

0.31

0.27

TELUGU-LLAMA-7B-INSTRUCT-V0.1

0.04

0.58

0.03

There are no rows in this table

⁠

Proprietary, Open-Weights and Indic models- all Indic models are open weights

AGG- aggregate over all tasks

FACTUAL CORRECTNESS (FC):As Doddapaneni et al. (2024) had shown that LLM-based evaluators fail to identify subtle factual inaccuracies, we curate a separate metric to doublecheck facts like dates, numbers, procedure and medicine names

SEMANTIC SIMILARITY (SS): Similarly, we formulate another metric to specifically analyse if both the prediction and the ground-truth response convey the same information semantically, especially when the they are in different languages.

COHERENCE (COH): This metric evaluates if the model was able to stitch together appropriate pieces of information from the three data chunks provided to yield a coherent response.

CONCISENESS (CON): Since the knowledge base chunks extracted and provided to the model can be quite large, with important facts embedded at different positions, we build this metric to assess the ability of the model to extract and compress all these bits of information relevant to the query into a crisp response.

Key Findings

• Performance Variability: The study finds significant performance variability among models, with some smaller models outperforming larger ones. • Language-Specific Performance: Indic models do not consistently perform well on Indic language queries, and factual correctness is generally lower for non-English queries. This shows that there is still work to be done to build models that can answer questions reliably Indian languages • Locally-grounded, non-translated datasets: Our dataset includes various instances of code-switching, Indian English colloquialisms, and culturally specific questions which cannot be obtained by translating datasets, particularly with automated translations. While models were able to handle code-switching to a certain extent, responses varied greatly to culturally-relevant questions. This underscores the importance of collecting datasets from target populations while building solutions.

Future Directions

We plan to expand this research by incorporating more models and languages, broadening human evaluation, and continuing the development of realistic, culturally grounded benchmarks for multilingual model evaluation.

Gallery

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.