icon picker
Leaderboard: Better evaluations for Indic Models

Glocal Evaluation of Models[GEM]: Building a better evaluation tools - benchmark or leaderboard

This will help us understand where Indic models today far.
Give us a better understanding of the gaps to improve models- what benchmarks are lower than others?
Create a fair, trustworthy and useful evaluation to help model users pick models for their use-cases.

We will build a leaderboard by combining 2 workstreams: research and engineering

The leaderboard needs multiple parts to function as listed below.

Research team that identifies what goes into the benchmarks

A team that can deliberate what goes into the evaluations, what are the dimensions and what the datasets should look like.

Engineering team that builds and maintains leaderboards

Builds a system that can compute the evaluations, maintains the data sets and the APIs. Creates the dashboards that present evaluations

Human evaluation efforts

Human evaluation efforts are the most reliable source of annotation and dataset curation. Apart from automated evaluations- we believe the right incentive framework and data workers will create the best outcomes.

Funding

We will fund all the engineering and research of the effort to maintain trust and neutrality. We are open to funding for human evaluation efforts or the compute that powers GEM.

[WG] Realistic ASR Evaluations

ASR is a part of the voice-to-voice toolchain that we believe will be the primary use case for Indian language models. ASR evaluations have not been reliable and connected to the usability of Indic models- that is the problem we are trying to solve.
To solve this we are trying to connect identifying the metrics needed to an engineered product.

Research on use cases

As a first step we are listing domains/use-cases that we have identified where ASR systems are in demand.
We are then finding partners that can provide this data and agree on a data sharing pipeline

Engineering an ASR leaderboard

We are starting by writing a PRD based off existing evaluations
Timeline: ASR
Chart a budget
Publish LLM evals and ASR evals research for feedback
Upgrade UI of interactions
Aug 2024
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Sep 2024
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Oct 2024
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Nov 2024
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Dec 2024
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Jan 2025
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Feb 2025
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Mar 2025
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Apr 2025
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Chart a budget
Publish LLM evals and ASR evals research for feedback
Upgrade UI of interactions
Month
TodayFit
Table 2

Contribute: Volunteer and Open roles

We are open to contributions and feedback, feel free to leave a comment and let us know you are interested. We will reach out to you.

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.