Leaderboard: Better evaluations for Indic Models

Glocal Evaluation of Models[GEM]: Building a better evaluation tools - benchmark or leaderboard

This will help us understand where Indic models today far.

Give us a better understanding of the gaps to improve models- what benchmarks are lower than others?

Create a fair, trustworthy and useful evaluation to help model users pick models for their use-cases.

⁠

Glocal Evaluation of Models (GEM) Measuring AI’s effectiveness for Every Voice peopleplus.ai⁠

⁠

We will build a leaderboard by combining 2 workstreams: research and engineering

The leaderboard needs multiple parts to function as listed below.

Research team that identifies what goes into the benchmarks

A team that can deliberate what goes into the evaluations, what are the dimensions and what the datasets should look like.

Engineering team that builds and maintains leaderboards

Builds a system that can compute the evaluations, maintains the data sets and the APIs. Creates the dashboards that present evaluations

Human evaluation efforts

Human evaluation efforts are the most reliable source of annotation and dataset curation. Apart from automated evaluations- we believe the right incentive framework and data workers will create the best outcomes.

Funding

We will fund all the engineering and research of the effort to maintain trust and neutrality. We are open to funding for human evaluation efforts or the compute that powers GEM.

⁠

[WG] Realistic ASR Evaluations

ASR is a part of the voice-to-voice toolchain that we believe will be the primary use case for Indian language models. ASR evaluations have not been reliable and connected to the usability of Indic models- that is the problem we are trying to solve.

To solve this we are trying to connect identifying the metrics needed to an engineered product.