In this section, we will first provide a quick summary of the current state of data for the four important tasks in language technology, viz., machine translation, automatic speech recognition, text-to-speech and optical character recognition. We then define the desired characteristics of data for each of these tasks followed by an outline our our plan for collecting data at scale for these tasks. Lastly, we give a summary of the tasks done in the last quarter and the tasks that will be taken up in the next quarter.
3.1 Types of data
The aim of the project is to collect the following types of data:
Pre-training data: This would include raw monolingual corpora for MT and NLU, raw audio data for ASR, and raw (simulated) document/scene images for OCR. This data would largely be scraped from online sources and would require limited manual intervention, following which a small sample of the data would be manually verified. (Noisy/Mined) Training data: This would include (i) mined translation pairs for training MT models, (ii) ASR data scraped from YouTube videos as well as government sources (News On AIR, Prasara Bharati, etc.), (iii) Machine Translated training data for Sentiment Analysis, QA and NER, and (iv) Simulated document images for OCR. A small portion of this data would be verified by humans to estimate quality. Fine-tuning data: Unfortunately for some of the very low resource languages, it would be infeasible to mine noisy training data from the web (for example, we were able to scrape only 70,000 sentences for Santali from all news sources and Wikipedia). Hence, a small amount of fine-tuning data would be manually created for these languages (as no free training data would be available). Benchmark data: This will be clean, high quality human created data which will be used for evaluating the models trained using the above data. %Below we list down the principles to be followed for creating such benchmark data. The above data would be collected using five different modes: (i) curated from government sources on the web (e.g., News on AIR) (ii) curated from non-government sources on the web (e.g., Times of India) (iii) curated from sources which are free of any copyright (e.g., books whose copyright period has expired) (iv) collected manually using crowdsourcing platforms with explicit consent of the participants (v) collected manually using in-house or outsourced annotators who are explicitly paid for the content. We will contribute all of this data to NLTM's Hundi and NLTM, in turn, can release this data according to their data policy.
3.2 Current state of data
The Table below captures the sate of data for all the tasks of interest for the 22 constitutionally recognised languages. We list only those sources which are publicly available and have a sizeable amount of data. For MT, there are multiple sources which have already been collated in . Hence, we do not list these individual sources separately. State of Data: March 2022
3.3 Desired characteristics of the data
For each of the 4 tasks, viz., Machine Translation, Automatic Speech Recognition and Optical Character Recognition, we list down the desired characteristics of the data in the Table below.
3.4 Our goals
For each of the four tasks, we describe our goals as well as the details of the data that will be collected.
3.4.1 Machine Translation
We will collect 100K parallel sentences between English and each of the 22 languages. The distribution of these 100K sentences would be as follows:
50K English sentences taken from Wikipedia and government sources from 12 different domains, viz. Legal, Government, History, Geography, Tourism, STEM, Religion, Business, Sports, Entertainment, Health, Culture, News. These sentences would be translated to all the 22 languages to create n-way parallel data. This will ensure that the parallel data has diversity in domains and contains formally written content. 30K English sentences from daily conversations in the Indian context in 20 different domains (e.g., railway stations, Indian tourist spots, etc). These sentences would be translated to all the 22 languages to create n-way parallel data. This will ensure that the parallel data has diversity in domains and contains informally written content with a focus on everyday conversations (a primary use case of speech-to-speech translation systems). 5K English sentences corresponding to reviews of 500 popular products which will be translated to all the 22 languages to create n-way parallel data. This will again ensure that the parallel data has some commercial content and diversity in writing style. 10K English sentences taken from government acts and polices which will be translated to all the 22 languages to create n-2ay parallel data. This will ensure representation of content that is typical translated by government bodies. 5K source regional languages sentences taken from books which were originally written in the regional languages. For each of the 22 languages, such 5K sentences will be translated to English (this will not be n-way parallel). 10% of the above data will be reserved as benchmark data and the rest will be used as training data.
Summary of MT data collection goals
3.4.2 Automatic Speech Recognition
For collecting labeled data for training ASR models, we will adopt two strategies: (i) collect data from the field to ensure speaker diversity and coverage of specific content which is hard to obtain elsewhere (e.g., voice commands) (ii) label existing data from news, entertainment and educational content.
Collecting data from the field: We will collect data from 600 speakers spread across districts wherein each speaker will:
Read 100 sentences (~10 minutes) Speak 200 voice commands (~10 minutes) Participate in an extempore get-to-know-me interview (~10 minutes) Read 100 English sentences (only from a few speakers who also speak English. This will ensure that we also collect Indian accent English data on the field). This will ensure that we collect data which has (i) high speaker diversity (number and variety), high content diversity (the 100 sentences will come from a larger pool of 50000 diverse sentences from different domains) and (iii) high downstream applicability (voice commands catering to a variety of use cases).
Labelling existing audio/video data: We will label existing data from youtube and the content/media industry. This data can be further split into the following types:
News: This will be primarily sourced from news channels and can be further categorised into the following types: Headlines: This is content of the type “Top 20 headlines of the hour” which does not have high speaker diversity but has peculiar characteristics like jarring background music. On-field reporting: This is content of the type “cameraman Prakash ke saath….” which is extempore, has background noise and involves common people on the ground. Debates: Such content would have diversity in content (CAA, killing of X, foreign policy of India, etc) and will also have peculiar characteristics like emotional outbursts, overlapping chatter, etc Interviews: Such content would involve a news anchor and 1-2 experts and caters to a variety of topics. The experts do not follow a script so the content has the flavour of natural speech. Special reports: Such content involves people on the ground and has good vocabulary spanning multiple domains. Entertainment: This will be primarily sourced from entertainment channels and would include content from different genres: family shows, comedy shows, crime shows, reality shows, cooking shows, travel shows, songs Education: This will be primarily sourced from education channels and would contain content from STEM, Health and How-to videos. Call-centre: This will be primarily sourced from call centres catering to one or more of the following domains: agriculture, legal, banking, insurance, health
Our ASR data collection goals are summarised in the Table below.
Summary of ASR data collection goals
3.4.3 Text-to-speech
The Text-to-speech data will be collected with the help of professional voice artists hired through a production studio. For each language, we will collect 20 hours of data from a male artist and 20 hours of data from a female artist. The artists will be given prompts from multiple domains. The prompts will be derived from the following two sources:
Source original contents from books: As mentioned earlier (section 4.4.1), we will be sourcing around 5000 sentences from books which were originally written in the native language. In addition to translating these sentences to English, we will also use them as prompts for the voice artists. Such content taken from books is typically rich in sentence structure, vocabulary and emotions. Hence, it would be ideal for recording by professional artists. Translations from multiple domains: As mentioned earlier (section 4.4.1), we will be translating English sentences taken from multiple domains into regional languages. These translated sentences will contain diverse content form multiple domains and will be provide as prompts for the voice artists. This will ensure that the recorded content has a good representation of domains and broader coverage of vocabulary. Summary of TTS Data Collection Goals
3.4.4 Optical Character Recognition
Our focus will be on creating datasets for two tasks:
Scene Text Recognition: We will create a benchmark of 10000 images for each language. These images would be natural images having text written in (i) a wide variety of fonts, colors, designs and sizes (ii) different orientations (straight, angular, circular, etc) (iii) multiple languages in the same image. Layout Detection: For this, we will create simulated data where different layout templates will be created and regional language content will be inserted in these layouts. These layouts would (i) have a wide variety of fonts (ii) have content which is italicized/bold (iii) have document structure such as sections, sub-section, indentations, paragraphs, bullet points (iv) would contain figures (the text in the figures should also be recognised) (v) would contain tables which have multiple columns with some columns having multi-column headings and some rows spanning multiple rows. We will be focusing only on machine-generated pdfs and not scanned pdfs. Summary of OCR (scene) goals
Viewpoint of Photographer
Summary of OCR (layout detection) goals
3.4.5 Summary
The chart below summarises our goals as well as the flow of data collection.
3.5 How will we achieve these goals?
To achieve these goals, we are taking a 4-pronged approach.
Recruit in-house translators: For all the 22 languages, we are hiring a team of 5 junior language experts (translators/annotators/transcribers) and 2 senior language experts. These experts will directly be on the payrolls of AI4Bharat.
Partnering with universities: For Kashmiri, Urdu, Konkani and Marathi we will partner with specific academic institutes (Goa University, Mumbai University, Kashmiri University)
Partnering with the social sector: For 8 languages (Assamese, Bodo, Dogri, Maithili, Manipuri, Nepali, Sanskrit, Santali) we have partnered with entities or individuals working in the social sector.
Outsourcing to data collection agencies: For TTS, where we need to collect studio quality data from professional voice artists, we will be outsourcing the activity to 3S Studio (this was recommended by our colleagues at IITM who have done a fair amount of data collection with them in the past). Similarly, for voice collection for 11 languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu, Urdu) we will be partnering with an external data collection agency. Their help will be needed in collecting voice samples from every district in the country. In addition, for some languages where it is difficult to find language experts, we may partner with some data collection agencies to collect translations also.
The table below summarises our plan for data collection.
Summary of Plan for Data collection
3.6 What are our timelines?
Our timelines for data collection are summarised in the Table below
Appendix
A.1 Guidelines for MT data collection
Have requested Prof. Pushpak to share guidelines from past projects.
A.2 Guidelines for ASR data collection
Have requested Prof. Umesh to share guidelines from past projects.
A.3 Guidelines for TTS data collection
Have requested Prof. Hema to share guidelines from past projects.
A.4 Guidelines for OCR data collection