Through conversations with model users, we are trying to answer the following questions about
1. Where language models are being used today
2. How effective are they? How are you measuring their efficiency?
3. How are you ranking or choosing the best model for a task?
Way forward
1. We need to identify the tasks that need to be evaluated
2. Evaluation should grow out of academic benchmarks to adopt more on group requirements- solved for by identifying the right metric for each task.
3. Evaluation quality is connected largely to the evaluation datasets and maintaining it.
Through conversations with model builders, we are trying to answer the following questions about
1. Where can we improve Indic models to make the most population scale impact in India?
2. What are the capabilities of Indic models today- what do we need to get better at? How do we get better performing models?
3. How are we measuring the performance of models today- how effective are they for Indian languages, multimodality and other needs of use cases?
4. How can we make these evaluations more application, use case and real world value driven?