We will build a leaderboard by combining 2 workstreams: research and engineering
The leaderboard needs multiple parts to function as listed below.
Research team that identifies what goes into the benchmarks
A team that can deliberate what goes into the evaluations, what are the dimensions and what the datasets should look like.
Engineering team that builds and maintains leaderboards
Builds a system that can compute the evaluations, maintains the data sets and the APIs. Creates the dashboards that present evaluations
Human evaluation efforts
Human evaluation efforts are the most reliable source of annotation and dataset curation. Apart from automated evaluations- we believe the right incentive framework and data workers will create the best outcomes.
Funding
We will fund all the engineering and research of the effort to maintain trust and neutrality. We are open to funding for human evaluation efforts or the compute that powers GEM.
[WG] Realistic ASR Evaluations
ASR is a part of the voice-to-voice toolchain that we believe will be the primary use case for Indian language models. ASR evaluations have not been reliable and connected to the usability of Indic models- that is the problem we are trying to solve.
To solve this we are trying to connect identifying the metrics needed to an engineered product.
Research on use cases
As a first step we are listing domains/use-cases that we have identified where ASR systems are in demand.
We are then finding partners that can provide this data and agree on a data sharing pipeline