TTT

Leaderboard: Better evaluations for Indic Models

The Present State of Indic LLM evaluations

Explore

Research: How do we get to TTT

Research is a crucial aspect to understanding the problem, identifying the most valuable solutions and to enable the solutions we build to be grounded.

We divide TTT research into understanding the problem→ user research and building a solution→ academic research.

User research: Insights from our conversations with model users

Through conversations with model users, we are trying to answer the following questions about

1. Where language models are being used today

2. How effective are they? How are you measuring their efficiency?

3. How are you ranking or choosing the best model for a task?

Existing models are not capable enough for enterprise usage

TTS, STT/ASR, text generation and understanding, translation/transliteration are some common NLP tasks models are used for.

The models are lacking in relevant data to train on, making them less capable in understanding and generation.However, what is relevant is also experimental and variable by task.

Speech models have latency issues and are unable to hold natural conversations.

Reaching the population scale is made more difficult by the nuances and complexities of Indian languages and the number of dialects, accents and colloquialisms within regions.

As all foundational models are built with an English understanding, all Indian understanding is limited by translation capabilities.

Translation tends to remove unique Indian-ness and context from the vocabulary.

To choose a model off the market today, companies perform evaluations on their private datasets. Larger benchmarks in Indic languages fail to capture end user needs and remain largely academic.

The greatest difference in evaluation is made by the quality of the evaluation dataset which is very closely tied to the task the models are used for.

Way forward

1. We need to identify the tasks that need to be evaluated

2. Evaluation should grow out of academic benchmarks to adopt more on group requirements- solved for by identifying the right metric for each task.

3. Evaluation quality is connected largely to the evaluation datasets and maintaining it.

Open Questions

1. In your work or industry, where are you currently using language models? What specific tasks are these models performing for you?

2. What aspects of the models' performance are your users responding well to, and what aspects are causing challenges? Have you encountered any recurring issues like inaccuracies, biases, or hallucinations?

3. When selecting a language model for your tasks, what key features or capabilities do you look for? Are there any particular strengths or weaknesses in current models that significantly influence your choice?

4. What factors are most important to you when choosing a model - for example, open-source availability, multilingual capabilities, or specific performance metrics? How do you prioritise these factors?

5. How do you define and measure "success" for the tasks where you're using language models? What metrics or outcomes are most critical for you?

6. When evaluating different language models, which performance metrics do you consider most important? How do you balance trade-offs between different metrics for various tasks?

7. Could you describe 3 specific tasks you use language models for, and outline what characteristics make a model particularly suitable or unsuitable for each of those tasks?

8. Based on your experience working with language models, what additional factors or considerations do you think are important when choosing between different models? Is there anything else you'd like to share that could help in developing better model selection criteria?

⁠

Academic research: A literature survey on current Indic models, existing solutions to evaluations and how we can be better?

Through conversations with model builders, we are trying to answer the following questions about

1. Where can we improve Indic models to make the most population scale impact in India?

2. What are the capabilities of Indic models today- what do we need to get better at? How do we get better performing models?

3. How are we measuring the performance of models today- how effective are they for Indian languages, multimodality and other needs of use cases?

4. How can we make these evaluations more application, use case and real world value driven?

We are looking for help to answer these questions!

Join this project developing state-of-the-art evaluation tools for Indic languages as an NLP Researcher, where you'll collaborate with industry experts and academics to shape the future of language technology.

[link]

⁠

[WG] White paper on the present and future of Indic models

Our first step is understanding what Indic models today are ale to do and what are the evaluations exist. We will do a thorough literature survey for the same

Click on the link to find out more about what we are working on.

Open hyperlink

⁠

Timeline

Timeline

Add questions

Publish for feedback

Sep 2024

Oct 2024

Nov 2024

Dec 2024

Jan 2025

Feb 2025

Mar 2025

Apr 2025

May 2025

Add questions

Publish for feedback

Month

TodayFit

⁠

Table

Table

⁠

Contribute: Volunteer and Open roles

We are open to contributions and feedback, feel free to leave a comment and let us know you are interested. We will reach out to you.

⁠

Open hyperlink

⁠

User research: Insights from our conversations with model users

Existing models are not capable enough for enterprise usage

Open Questions

Academic research: A literature survey on current Indic models, existing solutions to evaluations and how we can be better?

We are looking for help to answer these questions!

[WG] White paper on the present and future of Indic models

Timeline

Table

Contribute: Volunteer and Open roles

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.