12 Phase 1 languages (P1): Assamese, Bengali, Gujarati, Hindi, Kannada, Maithili, Malayalam, Manipuri, Marathi, Sanskrit, Tamil, Urdu.
10 Phase 2 languages (P2): Bodo, Dogri, Kashmiri, Konkani, Nepali, Odia, Punjabi, Santali, Sindhi, Telugu.
Create a MT benchmark containing 10K En-X parallel sentences for P1 languages. Create an ASR benchmark of 50 hours for P1 languages containing (a) read speech (b) voice commands (c) transcribed extempore conversations (d) transcribed news content (e) transcribed education content (f) transcribed entertainment content. Create 10 hours of TTS data for P1 languages. Release synthetic training data containing 100K images each for document layout detection, document text recognition and scene text recognition for all the 22 languages