Skip to content
Quarterly Reports
Share
Explore
2022 Q1 Report

icon picker
3. Data

In this section, we will first provide a quick summary of the current state of data for the four important tasks in language technology, viz., machine translation, automatic speech recognition, text-to-speech and optical character recognition. We then define the desired characteristics of data for each of these tasks followed by an outline our our plan for collecting data at scale for these tasks. Lastly, we give a summary of the tasks done in the last quarter and the tasks that will be taken up in the next quarter.

3.1 Types of data

The aim of the project is to collect the following types of data:
Pre-training data: This would include raw monolingual corpora for MT and NLU, raw audio data for ASR, and raw (simulated) document/scene images for OCR. This data would largely be scraped from online sources and would require limited manual intervention, following which a small sample of the data would be manually verified.
(Noisy/Mined) Training data: This would include (i) mined translation pairs for training MT models, (ii) ASR data scraped from YouTube videos as well as government sources (News On AIR, Prasara Bharati, etc.), (iii) Machine Translated training data for Sentiment Analysis, QA and NER, and (iv) Simulated document images for OCR. A small portion of this data would be verified by humans to estimate quality.
Fine-tuning data: Unfortunately for some of the very low resource languages, it would be infeasible to mine noisy training data from the web (for example, we were able to scrape only 70,000 sentences for Santali from all news sources and Wikipedia). Hence, a small amount of fine-tuning data would be manually created for these languages (as no free training data would be available).
Benchmark data: This will be clean, high quality human created data which will be used for evaluating the models trained using the above data. %Below we list down the principles to be followed for creating such benchmark data.
The above data would be collected using five different modes: (i) curated from government sources on the web (e.g., News on AIR) (ii) curated from non-government sources on the web (e.g., Times of India) (iii) curated from sources which are free of any copyright (e.g., books whose copyright period has expired) (iv) collected manually using crowdsourcing platforms with explicit consent of the participants (v) collected manually using in-house or outsourced annotators who are explicitly paid for the content. We will contribute all of this data to NLTM's Hundi and NLTM, in turn, can release this data according to their data policy.

3.2 Current state of data

The Table below captures the sate of data for all the tasks of interest for the 22 constitutionally recognised languages. We list only those sources which are publicly available and have a sizeable amount of data. For MT, there are multiple sources which have already been collated in . Hence, we do not list these individual sources separately.
State of Data: March 2022
0
Task
Type of Data
Source
Link
Unit
as
bn
brx
doi
gu
hi
kn
ks
gom
mai
ml
mni
mr
ne
or
pa
san
sat
sd
ta
te
ur
1
MT
Noisy training
AI4Bharat
Sentences
141227
8604580
0
0
3067790
10125706
4093524
0
0
0
5924426
0
3627480
0
998228
2980383
0
0
0
5264867
4946035
0
2
Benchmark
FLORES
Sentences
1012
1012
0
0
1012
1012
1012
0
0
0
1012
0
1012
1012
1012
1012
0
0
0
1012
1012
1012
3
Benchmark
WAT 21
Sentences
2390
2390
0
0
2390
2390
2390
0
0
0
2390
0
2390
0
2390
2390
0
0
0
2390
2390
0
4
ASR
Pre-training
AI4Bharat
Hours
843
1035
64
614
1061
1075
1012
436
499
38
857
464
1054
707
1018
863
500
9
107
1012
1052
721
5
Fine-tuning
MUCS + MSR
Hours
0
0
0
0
40
95.05
0
0
0
0
0
0
93.89
0
94.54
0
0
0
0
40
40
0
6
Fine-tuning
OpenSLR
Hours
0
70.4
0
0
0
0
0
0
0
0
0
0
0
70.2
0
0
0
0
0
0
0
0
7
Fine-tuning
IITM
Hours
0
0
0
0
0
178.4
0
0
0
0
0
0
0
0
0
0
0
0
0
112.2
0
0
8
Benchmark
MUCS
Hours
0
0
0
0
5.26
5.49
0
0
0
0
0
0
0.67
0
4.66
0
0
0
0
4.41
4.39
0
9
Benchmark
OpenSLR
Hours
0
4.8
0
0
0
0
0
0
0
0
0
0
0
5
0
0
0
0
0
0
0
0
10
Benchmark
MSR
Hours
0
0
0
0
5
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4.2
4.2
0
11
Benchmark
IITM
Hours
0
0
0
0
0
4.9
0
0
0
0
0
0
0
0
0
0
0
0
0
3.8
0
0
12
TTS
Fine-tuning
IITM
Hours
27.39
20.07
9.78
0
31.69
41.86
19.16
0
0
0
20.89
20.75
16.4
0
9.66
27
0
0
0
53.59
36.71
0
13
Fine-tuning
IIITH
Hours
0
0
0
0
0
21.47
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
14
OCR ( Scene Recognition)
Noisy training
Synthetically generated
Images
0
5500000
0
0
5500000
6000000
4500000
0
0
0
6000000
0
0
0
2400000
5800000
0
0
0
6000000
6000000
5300000
15
Fine-tuning
AI4Bharat
Images
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
16
Benchmark
IIITH
Images
0
0
0
0
920
3023
0
0
0
0
807
0
0
0
0
0
0
0
0
2536
1211
1413
17
Benchmark
ICDAR
image
0
3766
0
0
0
3884
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4700
18
Benchmark
Kaggle
images
0
3882
0
0
0
3863
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
19
Benchmark
AI4Bharat
Images
0
0
0
0
0
2044
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
20
OCR ( Scene Detection)
Noisy training
Synthetically Generated (AI4Bharat)
Images
-
444759
0
0
404343
554038
315698
0
0
0
493711
0
0
0
425730
482149
0
0
0
481446
394680
500118
21
Noisy training
Synthetically Generated (MLT-19)
Images
0
48490
0
0
0
32540
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
51099
22
Fine-tuning
IITM (Ours)
Images
-
0
0
0
0
2031
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
23
Fine-tuning
ICDAR-19 Train Set
Images
-
1000
0
0
0
1000
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1000
24
Benchmark
IIIT-ILST
Images
-
0
0
0
105
176
0
0
0
0
256
0
0
0
0
0
0
0
0
348
312
299
25
OCR (document)
Noisy training
Synthetically Generated (IndicOCR-v2)
Line Images
335000
26
Fine-tuning
Multilingual OCR (IIIT-H)
Pages
2800
5000
5000
5000
5000
3500
5000
5000
5000
5000
27
Fine-tuning
iiit-indic-hw-words
Word Images
113000
116000
95000
103000
116000
101000
24000
103000
120000
100000
28
Fine-tuning
IndicOCR-v2
Line Images
24000
29
Benchmark
Indian Language Benchmark Portal-Offline (IIIT-H)
Line Images
400
400
400
400
400
400
400
400
400
400
There are no rows in this table
6
Count

3.3 Desired characteristics of the data

For each of the 4 tasks, viz., Machine Translation, Automatic Speech Recognition and Optical Character Recognition, we list down the desired characteristics of the data in the Table below.
Desired characteristics
0
Search
Machine Translation
Both the training and benchmark data should have the following characteristics: diversity in domains: The parallel sentences should cover a wide variety of domains such as Legal/Govt, History, Geography, Tourism, STEM, Religion, Business, Sports, Entertainment, Health, Culture and News diversity in lengths: For every domain of interest it is desired that the following ranges of sentence lengths have a good representation: 6-10 words, 11-17 words, 18-25 words, > 25 words. n-way parallel: The data should have n-way parallel sentences, i.e., a large fraction of the data should contain the same English sentences translated to all the 22 constitutionally recognised languages. source original: For each language, the data should contain some sentences which were originally return in that language and then translated to English discourse level translations: Instead of collecting translations of isolated sentences, it is preferred to translate entire paragraphs or a collection of contiguous sentences so that the data can also be used for training/evaluating discourse level translation models downstream applicability: While collecting training data from a variety of domains would be useful, one should also focus on collecting data for building practical applications, such as, translation for everyday usage/conversations
Automatic Speech Recognition
Both the training and benchmark data should have the following characteristics: diversity in speakers: For every speaker, the audio data should be collected from a wider variety of speakers having different accents (e.g., Surati v/s Vadadara), different ages (18-30, 30-45, 45-60, >60), different educational backgrounds (school level, graduate, post-graduate) and different genders. diversity in collection method: The data should contain a mix of read speech, extempore conversations and broadcast content such as news, educational videos, entertainment video. diversity in vocabulary: The audio should contain words from a wide variety of domains. diversity in genres: The audio sourced from broadcast content should come from a wide variety of genres (e.g., news debates, on-field news reports, comedy shows, reality shows, how-to videos, STEM videos, etc) downstream applicability: While collecting training data from a variety of domains, genres and speakers would be useful, one should also focus on collecting data for building practical applications, such as, voice commands for everyday usage in digital payments, e-commerce, etc.
Text-to-Speech
high quality recording: The data should be collected in a studio setup. high quality voice: The data should be collected from a professional voice artist. diversity in content: The spoken content should contain words from a wide variety of domains such as Legal/Govt, History, Geography, Tourism, STEM, Religion, Business, Sports, Entertainment, Health, Culture and News
Document Optical Character Recognition
diversity in font sizes: The scanned pages should contain text written in a wide variety of font sizes. diversity in font types: The scanned pages should contain text written in a wide variety of font types. diversity in layouts: The scanned pages should have a variety of layouts (one column, two column, magazine style, newspaper style, etc.). diversity in background effects: The scanned pages should have a variety of background effects such as crumbling, lighting effects, scan marks, page fold marks, etc
Scene Optical Character Recognition
diversity in background: The images containing text should have very diverse background such as buildings, sky, trees, etc. diversity in angles: The images should be taken from different angles (left, right, top, bottom, etc). diversity in font types: The images should contain text written in a wide variety of fonts. diversity in font sizes: The images should contain text written in a wide variety of sizes. diversity in orientation: The images should contain text with different orientations (horizontal, slanting, circular, etc). diversity in ambient light: The images should be taken under different lighting conditions.
Machine Translation
Desired characteristics
Both the training and benchmark data should have the following characteristics:
diversity in domains: The parallel sentences should cover a wide variety of domains such as Legal/Govt, History, Geography, Tourism, STEM, Religion, Business, Sports, Entertainment, Health, Culture and News
diversity in lengths: For every domain of interest it is desired that the following ranges of sentence lengths have a good representation: 6-10 words, 11-17 words, 18-25 words, > 25 words.
n-way parallel: The data should have n-way parallel sentences, i.e., a large fraction of the data should contain the same English sentences translated to all the 22 constitutionally recognised languages.
source original: For each language, the data should contain some sentences which were originally return in that language and then translated to English
discourse level translations: Instead of collecting translations of isolated sentences, it is preferred to translate entire paragraphs or a collection of contiguous sentences so that the data can also be used for training/evaluating discourse level translation models
downstream applicability: While collecting training data from a variety of domains would be useful, one should also focus on collecting data for building practical applications, such as, translation for everyday usage/conversations

3.4 Our goals

For each of the four tasks, we describe our goals as well as the details of the data that will be collected.

3.4.1 Machine Translation

We will collect 100K parallel sentences between English and each of the 22 languages. The distribution of these 100K sentences would be as follows:
50K English sentences taken from Wikipedia and government sources from 12 different domains, viz. Legal, Government, History, Geography, Tourism, STEM, Religion, Business, Sports, Entertainment, Health, Culture, News. These sentences would be translated to all the 22 languages to create n-way parallel data. This will ensure that the parallel data has diversity in domains and contains formally written content.
30K English sentences from daily conversations in the Indian context in 20 different domains (e.g., railway stations, Indian tourist spots, etc). These sentences would be translated to all the 22 languages to create n-way parallel data. This will ensure that the parallel data has diversity in domains and contains informally written content with a focus on everyday conversations (a primary use case of speech-to-speech translation systems).
5K English sentences corresponding to reviews of 500 popular products which will be translated to all the 22 languages to create n-way parallel data. This will again ensure that the parallel data has some commercial content and diversity in writing style.
10K English sentences taken from government acts and polices which will be translated to all the 22 languages to create n-2ay parallel data. This will ensure representation of content that is typical translated by government bodies.
5K source regional languages sentences taken from books which were originally written in the regional languages. For each of the 22 languages, such 5K sentences will be translated to English (this will not be n-way parallel).
10% of the above data will be reserved as benchmark data and the rest will be used as training data.
Summary of MT data collection goals
0
Domain
Source
Number of sentences
Direction
n-way parallel
Potential Challenges
1
Legal
Government
5000
En-X
Yes
2
Government Policies
Government
5000
En-X
Yes
3
History
Wikipedia
5000
En-X
Yes
4
Geography
Wikipedia
5000
En-X
Yes
5
Tourism
Wikipedia
5000
En-X
Yes
6
STEM
Wikipedia
5000
En-X
Yes
7
Business
Wikipedia
5000
En-X
Yes
8
Sports
Wikipedia
5000
En-X
Yes
9
Entertainment
Wikipedia
5000
En-X
Yes
10
Health
Wikipedia
5000
En-X
Yes
11
Culture
Wikipedia
5000
En-X
Yes
12
News
Wikipedia
5000
En-X
Yes
13
Everyday conversations
AI4Bharat
30000
En-X
Yes
14
Product Reviews
AI4Bharat
5000
En-X
Yes
15
Literature
Books
5000
X-En
No
Source original content is mainly derived from books written in that language. However, taking content from such books may have copyright issues.
There are no rows in this table
15
Count

3.4.2 Automatic Speech Recognition

For collecting labeled data for training ASR models, we will adopt two strategies: (i) collect data from the field to ensure speaker diversity and coverage of specific content which is hard to obtain elsewhere (e.g., voice commands) (ii) label existing data from news, entertainment and educational content.
Collecting data from the field: We will collect data from 600 speakers spread across districts wherein each speaker will:
Read 100 sentences (~10 minutes)
Speak 200 voice commands (~10 minutes)
Participate in an extempore get-to-know-me interview (~10 minutes)
Read 100 English sentences (only from a few speakers who also speak English. This will ensure that we also collect Indian accent English data on the field).
This will ensure that we collect data which has (i) high speaker diversity (number and variety), high content diversity (the 100 sentences will come from a larger pool of 50000 diverse sentences from different domains) and (iii) high downstream applicability (voice commands catering to a variety of use cases).
Labelling existing audio/video data: We will label existing data from youtube and the content/media industry. This data can be further split into the following types:
News: This will be primarily sourced from news channels and can be further categorised into the following types:
Headlines: This is content of the type “Top 20 headlines of the hour” which does not have high speaker diversity but has peculiar characteristics like jarring background music.
On-field reporting: This is content of the type “cameraman Prakash ke saath….” which is extempore, has background noise and involves common people on the ground.
Debates: Such content would have diversity in content (CAA, killing of X, foreign policy of India, etc) and will also have peculiar characteristics like emotional outbursts, overlapping chatter, etc
Interviews: Such content would involve a news anchor and 1-2 experts and caters to a variety of topics. The experts do not follow a script so the content has the flavour of natural speech.
Special reports: Such content involves people on the ground and has good vocabulary spanning multiple domains.
Entertainment: This will be primarily sourced from entertainment channels and would include content from different genres: family shows, comedy shows, crime shows, reality shows, cooking shows, travel shows, songs
Education: This will be primarily sourced from education channels and would contain content from STEM, Health and How-to videos.
Call-centre: This will be primarily sourced from call centres catering to one or more of the following domains: agriculture, legal, banking, insurance, health

Our ASR data collection goals are summarised in the Table below.
Summary of ASR data collection goals
0
Type of Data
Number of Hours (MR/LR)
Potential Challenges
1
On-field data collection
300
2
Read speech
100
The clean sentences to be read by speaker may either come from native language books or from the translations of English sentences collected by AI4Bharat. If its the former then we will have copyright issues. If its the latter then we will have to delay this activity till we have enough sentences translated.
3
Voice commands
100
4
Extempore conversation
100
5
Labelling existing audio data
700/200
For low resource languages such as Assamese, Bodo, Dogri, Kashmiri, Konkani, Maithili, Manipuri, Nepali, Odia, Sanskrit and Santali we will label only 200 hours of existing audio data.
6
News
200/100
We will depend on content providers such as Prasar Bharati to get the raw audio content. As of now it is not clear whether such content will be shared with us and if it will have enough diversity. We may have to revise these goals based on the availability of such data.
7
Entertainment
200/50
We will depend on content providers such as Prasar Bharati to get the raw audio content. As of now it is not clear whether such content will be shared with us and if it will have enough diversity. We may have to revise these goals based on the availability of such data.
8
Education
200/50
We will depend on content providers such as Prasar Bharati to get the raw audio content. As of now it is not clear whether such content will be shared with us and if it will have enough diversity. We may have to revise these goals based on the availability of such data.
9
Call centre data
100/0
We will depend on government/private call centres to share the raw audio data with us. Given the general privacy concerns around such data, we are not sure if we will get access to it. We may have to revise these goals based on the availability of such data.
There are no rows in this table
9
Count

3.4.3 Text-to-speech

The Text-to-speech data will be collected with the help of professional voice artists hired through a production studio. For each language, we will collect 20 hours of data from a male artist and 20 hours of data from a female artist. The artists will be given prompts from multiple domains. The prompts will be derived from the following two sources:
Source original contents from books: As mentioned earlier (section 4.4.1), we will be sourcing around 5000 sentences from books which were originally written in the native language. In addition to translating these sentences to English, we will also use them as prompts for the voice artists. Such content taken from books is typically rich in sentence structure, vocabulary and emotions. Hence, it would be ideal for recording by professional artists.
Translations from multiple domains: As mentioned earlier (section 4.4.1), we will be translating English sentences taken from multiple domains into regional languages. These translated sentences will contain diverse content form multiple domains and will be provide as prompts for the voice artists. This will ensure that the recorded content has a good representation of domains and broader coverage of vocabulary.
Summary of TTS Data Collection Goals
0
Prompts
Number of sentences
Characteristics
Hours (appx)
Potential challenges
1
Source original content
5000
rich in native language structure, vocabulary and emotions
5
Source original content is mainly derived from books written in that language. However, taking content from such books may have copyright issues.
2
Translations from multiple domains
15000
good representation of domains and broader coverage of vocabulary
15
Will depend on the translations collected for MT data and hence there might be some delay before this activity can be started.
There are no rows in this table

3.4.4 Optical Character Recognition

Our focus will be on creating datasets for two tasks:
Scene Text Recognition: We will create a benchmark of 10000 images for each language. These images would be natural images having text written in (i) a wide variety of fonts, colors, designs and sizes (ii) different orientations (straight, angular, circular, etc) (iii) multiple languages in the same image.
Layout Detection: For this, we will create simulated data where different layout templates will be created and regional language content will be inserted in these layouts. These layouts would (i) have a wide variety of fonts (ii) have content which is italicized/bold (iii) have document structure such as sections, sub-section, indentations, paragraphs, bullet points (iv) would contain figures (the text in the figures should also be recognised) (v) would contain tables which have multiple columns with some columns having multi-column headings and some rows spanning multiple rows. We will be focusing only on machine-generated pdfs and not scanned pdfs.
Summary of OCR (scene) goals
0
Image Source
Text Orientation
Zoom
Image Quality
Lighting condition
Viewpoint of Photographer
Distance of photographer
Number of images
1
Signboards
straight horizontal, slanting horizontal, straight vertical, slanting vertical, curved
zoomed in, zoomed out, normal
high Resolution, low Resolution
daylight, night, dim light, artificial light
Top, Below, Normal
Far, Close, Normal
1250
2
Billboards
straight horizontal, slanting horizontal, straight vertical, slanting vertical, curved
zoomed in, zoomed out, normal
high Resolution, low Resolution
daylight, night, dim light, artificial light
Top, Below, Normal
Far, Close, Normal
1250
3
Movie Posters
straight horizontal, slanting horizontal, straight vertical, slanting vertical, curved
zoomed in, zoomed out, normal
high Resolution, low Resolution
daylight, night, dim light, artificial light
Top, Below, Normal
Far, Close, Normal
1250
4
Shop Banners
straight horizontal, slanting horizontal, straight vertical, slanting vertical, curved
zoomed in, zoomed out, normal
high Resolution, low Resolution
daylight, night, dim light, artificial light
Top, Below, Normal
Far, Close, Normal
1250
5
Political Banners
straight horizontal, slanting horizontal, straight vertical, slanting vertical, curved
zoomed in, zoomed out, normal
high Resolution, low Resolution
daylight, night, dim light, artificial light
Top, Below, Normal
Far, Close, Normal
1250
6
Railway Station Boards
straight horizontal, slanting horizontal, straight vertical, slanting vertical, curved
zoomed in, zoomed out, normal
high Resolution, low Resolution
daylight, night, dim light, artificial light
Top, Below, Normal
Far, Close, Normal
1250
7
Advertisements
straight horizontal, slanting horizontal, straight vertical, slanting vertical, curved
zoomed in, zoomed out, normal
high Resolution, low Resolution
daylight, night, dim light, artificial light
Top, Below, Normal
Far, Close, Normal
1250
8
Highway Mile Stones/ Mile Markers
straight horizontal, slanting horizontal, straight vertical, slanting vertical, curved
zoomed in, zoomed out, normal
high Resolution, low Resolution
daylight, night, dim light, artificial light
Top, Below, Normal
Far, Close, Normal
1250
There are no rows in this table
Summary of OCR (layout detection) goals
0
Image Source
Text Orientation
Zoom
Image Quality
Lighting condition
Number of images
1
Books [Synthetic]
Horizontal
[0.5x,1x,2x]
[Clean, 2D Degraded, Botched]
[Normal, Modified]
2000
2
Question Papers [Synthetic]
Horizontal
[0.5x,1x,2x]
[Clean, 2D Degraded, Botched]
[Normal, Modified]
2000
3
Application Forms [Synthetic]
Horizontal
[0.5x,1x,2x]
[Clean, 2D Degraded, Botched]
[Normal, Modified]
2000
4
Receipts & Invoices [Synthetic]
Horizontal
[0.5x,1x,2x]
[Clean, 2D Degraded, Botched]
[Normal, Modified]
2000
5
Letters/Orders/Notices/Prescriptions [Synthetic]
Horizontal
[0.5x,1x,2x]
[Clean, 2D Degraded, Botched]
[Normal, Modified]
2000
6
Legal Documents & Court Judgements [Synthetic]
Horizontal
[0.5x,1x,2x]
[Clean, 2D Degraded, Botched]
[Normal, Modified]
2000
There are no rows in this table

3.4.5 Summary

The chart below summarises our goals as well as the flow of data collection.
image.png

3.5 How will we achieve these goals?

To achieve these goals, we are taking a 4-pronged approach.
Recruit in-house translators: For all the 22 languages, we are hiring a team of 5 junior language experts (translators/annotators/transcribers) and 2 senior language experts. These experts will directly be on the payrolls of AI4Bharat.
Partnering with universities: For Kashmiri, Urdu, Konkani and Marathi we will partner with specific academic institutes (Goa University, Mumbai University, Kashmiri University)
Partnering with the social sector: For 8 languages (Assamese, Bodo, Dogri, Maithili, Manipuri, Nepali, Sanskrit, Santali) we have partnered with entities or individuals working in the social sector.
Outsourcing to data collection agencies: For TTS, where we need to collect studio quality data from professional voice artists, we will be outsourcing the activity to 3S Studio (this was recommended by our colleagues at IITM who have done a fair amount of data collection with them in the past). Similarly, for voice collection for 11 languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu, Urdu) we will be partnering with an external data collection agency. Their help will be needed in collecting voice samples from every district in the country. In addition, for some languages where it is difficult to find language experts, we may partner with some data collection agencies to collect translations also.
The table below summarises our plan for data collection.
Summary of Plan for Data collection
0
Language
Machine Translation
ASR
TTS
OCR
1
Assamese
in-house translators
3S Studio
2
Bengali
in-house translators
Outsourced
3S Studio
Outsourced
3
Bodo
in-house translators
3S Studio
4
Dogri
in-house translators
3S Studio
5
Gujarati
in-house translators
Outsourced
3S Studio
Outsourced
6
Hindi
in-house translators
Outsourced
3S Studio
Outsourced
7
Kannada
in-house translators
Outsourced
3S Studio
Outsourced
8
Kashmiri
in-house translators
3S Studio
9
Konkani
in-house translators
3S Studio
10
Maithili
in-house translators
3S Studio
11
Malayalam
in-house translators
Outsourced
3S Studio
Outsourced
12
Manipuri
in-house translators
3S Studio
13
Marathi
in-house translators
Mumbai University
3S Studio
Mumbai University
14
Nepali
in-house translators
3S Studio
15
Odia
in-house translators
TBD
3S Studio
TBD
16
Punjabi
in-house translators + Outsourced
Outsourced
3S Studio
Outsourced
17
Sanskrit
in-house translators
3S Studio
18
Santali
in-house translators
3S Studio
19
Sindhi
in-house translators + Outsourced
TBD
3S Studio
TBD
20
Tamil
in-house translators
Outsourced
3S Studio
Outsourced
21
Telugu
in-house translators
Outsourced
3S Studio
Outsourced
22
Urdu
in-house translators
Outsourced
3S Studio
There are no rows in this table

3.6 What are our timelines?

Our timelines for data collection are summarised in the Table below
DMU Timelines
0
Name
Deliverables
1
Y1-Q1 (Apr-Jun 2022)
Develop Shoonya v1, as an open-source tool for collecting MT, ASR and NLU datasets for all the 22 languages.
Set up teams of language experts (annotators, translators, transcribers) for all the 22 languages.
Run pilot for on-field 100 hours of voice data collection for Tamil.
Collect a total of 50K English sentences from diverse domains which will subsequently be translated to 22 Indian languages.
Collect a total of 50K sentences of everyday conversational content in English which will subsequently be translated to 22 Indian languages
Release 1M mined English-X parallel sentences for 11 languages: Bengali, Gujarati, Hindi, Kannada, Malayalam Marathi, Nepali, Punjabi, Tamil, Telugu, Urdu
Release 500 hours of mined ASR data for 11 languages: Bengali, Gujarati, Hindi, Kannada, Malayalam Marathi, Odia, Punjabi, Tamil, Telugu, Urdu
2
Y1-Q2
12 Phase 1 languages (P1): Assamese, Bengali, Gujarati, Hindi, Kannada, Maithili, Malayalam, Manipuri, Marathi, Sanskrit, Tamil, Urdu.
10 Phase 2 languages (P2): Bodo, Dogri, Kashmiri, Konkani, Nepali, Odia, Punjabi, Santali, Sindhi, Telugu.
Develop Shoonya, as an open-source tool for collecting MT, ASR and NLU datasets for all the 22 languages.
Create a MT benchmark containing 10K En-X parallel sentences for P1 languages.
Create an ASR benchmark of 25 hours for P1 languages containing (a) read speech (b) voice commands (c) transcribed extempore conversations (d) transcribed news content (e) transcribed education content (f) transcribed entertainment content.
Create 10 hours of TTS data for P1 languages.
Release synthetic training data containing 100K images each for document layout detection, document text recognition and scene text recognition for all the 22 languages
3
Y1-Q3
Develop Shoonya, as an open-source tool for collecting MT, ASR and NLU datasets for all the 22 languages.
Create a MT benchmark containing 10K En-X parallel sentences for P2 languages.
Create an ASR benchmark of 50 hours for P2 languages containing (a) read speech (b) voice commands (c) transcribed extempore conversations (d) transcribed news content (e) transcribed education content (f) transcribed entertainment content.
Create 10 hours of TTS data for P2 languages.
4
Y1-Q4
Create 30K En-X parallel sentences (fine-tuning data) for all 22 languages
Create 100 hours of ASR data for all 22 languages
Create 10 hours of TTS data for for all 22 languages
Create a benchmark for Scene Text Recognition containing 500 images for all 22 languages (13 scripts)
Create a benchmark for document OCR containing 500 scanned pages for all 22 languages (13 scripts)
5
Y2-Q1
Create 30K En-X parallel sentences (fine-tuning data) for all 22 languages
Create 100 hours of ASR data for all 22 languages
Create 10 hours of TTS data for for all 22 languages
Create a benchmark for Scene Text Recognition containing additional 500 images for all 22 languages (13 scripts)
Create a benchmark for document OCR containing 500 scanned pages for all 22 languages (13 scripts)
6
Y2-Q2
Create 40K En-X parallel sentences (fine-tuning data) for all 22 languages
Create 100 hours of ASR data for all 22 languages
Create 10 hours of TTS data for for all 22 languages
7
Y2-Q3
Create 100 hours of ASR data for all 22 languages
Create 5K QA pairs for all 22 languages
Create 5K NER tagged sentences for all 22 languages
Create 5K sentiment labeled sentences for all 22 languages
8
Y2-Q4
Create 100 hours of ASR data for all 22 languages
Create 5K QA pairs for all 22 languages
Create 5K NER tagged sentences for all 22 languages
Create 5K sentiment labeled sentences for all 22 languages
Create 100K translated QA pairs (noisy training data) for all 22 languages
Create 100K noisy NER sentences (translation + projection) for all 22 languages
Create 100K translated SA sentences for all 22 languages
9
Y3-Q1
Build and release version 1 of ASR, TTS, MT, NLU models for P1 languages
10
Y3-Q2
Build and release version 1 of ASR, TTS, MT, NLU models for P2 languages
11
Y3-Q3
Build and release version 2 of ASR, TTS, MT, NLU models for P1 languages
12
Y3-Q4
Build and release version 2 of ASR, TTS, MT, NLU models for P2 languages
There are no rows in this table

Appendix

A.1 Guidelines for MT data collection

Have requested Prof. Pushpak to share guidelines from past projects.

A.2 Guidelines for ASR data collection

Have requested Prof. Umesh to share guidelines from past projects.

A.3 Guidelines for TTS data collection

Have requested Prof. Hema to share guidelines from past projects.

A.4 Guidelines for OCR data collection















Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.