Quarterly Reports

Explore

2022 Q1 Report

3. Data

In this section, we will first provide a quick summary of the current state of data for the four important tasks in language technology, viz., machine translation, automatic speech recognition, text-to-speech and optical character recognition. We then define the desired characteristics of data for each of these tasks followed by an outline our our plan for collecting data at scale for these tasks. Lastly, we give a summary of the tasks done in the last quarter and the tasks that will be taken up in the next quarter.

3.1 Types of data

The aim of the project is to collect the following types of data:

Pre-training data: This would include raw monolingual corpora for MT and NLU, raw audio data for ASR, and raw (simulated) document/scene images for OCR. This data would largely be scraped from online sources and would require limited manual intervention, following which a small sample of the data would be manually verified.

(Noisy/Mined) Training data: This would include (i) mined translation pairs for training MT models, (ii) ASR data scraped from YouTube videos as well as government sources (News On AIR, Prasara Bharati, etc.), (iii) Machine Translated training data for Sentiment Analysis, QA and NER, and (iv) Simulated document images for OCR. A small portion of this data would be verified by humans to estimate quality.

Fine-tuning data: Unfortunately for some of the very low resource languages, it would be infeasible to mine noisy training data from the web (for example, we were able to scrape only 70,000 sentences for Santali from all news sources and Wikipedia). Hence, a small amount of fine-tuning data would be manually created for these languages (as no free training data would be available).

Benchmark data: This will be clean, high quality human created data which will be used for evaluating the models trained using the above data. %Below we list down the principles to be followed for creating such benchmark data.

The above data would be collected using five different modes: (i) curated from government sources on the web (e.g., News on AIR) (ii) curated from non-government sources on the web (e.g., Times of India) (iii) curated from sources which are free of any copyright (e.g., books whose copyright period has expired) (iv) collected manually using crowdsourcing platforms with explicit consent of the participants (v) collected manually using in-house or outsourced annotators who are explicitly paid for the content. We will contribute all of this data to NLTM's Hundi and NLTM, in turn, can release this data according to their data policy.

3.2 Current state of data

The Table below captures the sate of data for all the tasks of interest for the 22 constitutionally recognised languages. We list only those sources which are publicly available and have a sizeable amount of data. For MT, there are multiple sources which have already been collated in

Samanantar 0.3⁠

. Hence, we do not list these individual sources separately.

State of Data: March 2022

State of Data: March 2022

Task

Type of Data

Source

Link

Unit

brx

doi

gom

mai

mni

san

sat

Noisy training

AI4Bharat

Sentences

141227

8604580

3067790

10125706

4093524

5924426

3627480

998228

2980383

5264867

4946035

Benchmark

FLORES

Sentences

1012

Benchmark

WAT 21

Sentences

2390

ASR

Pre-training

AI4Bharat

Hours

843

1035

614

1061

1075

1012

436

499

857

464

1054

707

1018

863

500

107

1012

1052

721

Fine-tuning

MUCS + MSR

Hours

95.05

93.89

94.54

Fine-tuning

OpenSLR

Hours

70.4

70.2

Fine-tuning

IITM

Hours

178.4

112.2

Benchmark

MUCS

Hours

5.26

5.49

0.67

4.66

4.41

4.39

Benchmark

OpenSLR

Hours

4.8

Benchmark

MSR

Hours

4.2

Benchmark

IITM

Hours

4.9

3.8

TTS

Fine-tuning

IITM

Hours

27.39

20.07

9.78

31.69

41.86

19.16

20.89

20.75

16.4

9.66

53.59

36.71

Fine-tuning

IIITH

Hours

21.47

OCR ( Scene Recognition)

Noisy training

Synthetically generated

Images

5500000

6000000

4500000

6000000

2400000

5800000

6000000

5300000

Fine-tuning

AI4Bharat

Images

Benchmark

IIITH

Images

920

3023

807

2536

1211

1413

Benchmark

ICDAR

image

3766

3884

4700

Benchmark

Kaggle

images

3882

3863

Benchmark

AI4Bharat

Images

2044

OCR ( Scene Detection)

Noisy training

Synthetically Generated (AI4Bharat)

Images

444759

404343

554038

315698

493711

425730

482149

481446

394680

500118

Noisy training

Synthetically Generated (MLT-19)

Images

48490

32540

51099

Fine-tuning

IITM (Ours)

Images

2031

Fine-tuning

ICDAR-19 Train Set

Images

1000

Benchmark

IIIT-ILST

Images

105

176

256

348

312

299

OCR (document)

Noisy training

Synthetically Generated (IndicOCR-v2)

Line Images

335000

Fine-tuning

Multilingual OCR (IIIT-H)

Pages

2800

5000

3500

5000

Fine-tuning

iiit-indic-hw-words

Word Images

113000

116000

95000

103000

116000

101000

24000

103000

120000

100000

Fine-tuning

IndicOCR-v2

Line Images

24000

Benchmark

Indian Language Benchmark Portal-Offline (IIIT-H)

Line Images

400

There are no rows in this table

Count

⁠

3.3 Desired characteristics of the data

For each of the 4 tasks, viz., Machine Translation, Automatic Speech Recognition and Optical Character Recognition, we list down the desired characteristics of the data in the Table below.

Desired characteristics

Desired characteristics

Machine Translation

Both the training and benchmark data should have the following characteristics: diversity in domains: The parallel sentences should cover a wide variety of domains such as Legal/Govt, History, Geography, Tourism, STEM, Religion, Business, Sports, Entertainment, Health, Culture and News diversity in lengths: For every domain of interest it is desired that the following ranges of sentence lengths have a good representation: 6-10 words, 11-17 words, 18-25 words, > 25 words. n-way parallel: The data should have n-way parallel sentences, i.e., a large fraction of the data should contain the same English sentences translated to all the 22 constitutionally recognised languages. source original: For each language, the data should contain some sentences which were originally return in that language and then translated to English discourse level translations: Instead of collecting translations of isolated sentences, it is preferred to translate entire paragraphs or a collection of contiguous sentences so that the data can also be used for training/evaluating discourse level translation models downstream applicability: While collecting training data from a variety of domains would be useful, one should also focus on collecting data for building practical applications, such as, translation for everyday usage/conversations

Automatic Speech Recognition

Both the training and benchmark data should have the following characteristics: diversity in speakers: For every speaker, the audio data should be collected from a wider variety of speakers having different accents (e.g., Surati v/s Vadadara), different ages (18-30, 30-45, 45-60, >60), different educational backgrounds (school level, graduate, post-graduate) and different genders. diversity in collection method: The data should contain a mix of read speech, extempore conversations and broadcast content such as news, educational videos, entertainment video. diversity in vocabulary: The audio should contain words from a wide variety of domains. diversity in genres: The audio sourced from broadcast content should come from a wide variety of genres (e.g., news debates, on-field news reports, comedy shows, reality shows, how-to videos, STEM videos, etc) downstream applicability: While collecting training data from a variety of domains, genres and speakers would be useful, one should also focus on collecting data for building practical applications, such as, voice commands for everyday usage in digital payments, e-commerce, etc.

Text-to-Speech

high quality recording: The data should be collected in a studio setup. high quality voice: The data should be collected from a professional voice artist. diversity in content: The spoken content should contain words from a wide variety of domains such as Legal/Govt, History, Geography, Tourism, STEM, Religion, Business, Sports, Entertainment, Health, Culture and News

Document Optical Character Recognition

diversity in font sizes: The scanned pages should contain text written in a wide variety of font sizes. diversity in font types: The scanned pages should contain text written in a wide variety of font types. diversity in layouts: The scanned pages should have a variety of layouts (one column, two column, magazine style, newspaper style, etc.). diversity in background effects: The scanned pages should have a variety of background effects such as crumbling, lighting effects, scan marks, page fold marks, etc

Scene Optical Character Recognition

diversity in background: The images containing text should have very diverse background such as buildings, sky, trees, etc. diversity in angles: The images should be taken from different angles (left, right, top, bottom, etc). diversity in font types: The images should contain text written in a wide variety of fonts. diversity in font sizes: The images should contain text written in a wide variety of sizes. diversity in orientation: The images should contain text with different orientations (horizontal, slanting, circular, etc). diversity in ambient light: The images should be taken under different lighting conditions.

Machine Translation

Desired characteristics

Both the training and benchmark data should have the following characteristics:

diversity in domains: The parallel sentences should cover a wide variety of domains such as Legal/Govt, History, Geography, Tourism, STEM, Religion, Business, Sports, Entertainment, Health, Culture and News

diversity in lengths: For every domain of interest it is desired that the following ranges of sentence lengths have a good representation: 6-10 words, 11-17 words, 18-25 words, > 25 words.

n-way parallel: The data should have n-way parallel sentences, i.e., a large fraction of the data should contain the same English sentences translated to all the 22 constitutionally recognised languages.

source original: For each language, the data should contain some sentences which were originally return in that language and then translated to English

discourse level translations: Instead of collecting translations of isolated sentences, it is preferred to translate entire paragraphs or a collection of contiguous sentences so that the data can also be used for training/evaluating discourse level translation models

downstream applicability: While collecting training data from a variety of domains would be useful, one should also focus on collecting data for building practical applications, such as, translation for everyday usage/conversations

⁠

3.4 Our goals

For each of the four tasks, we describe our goals as well as the details of the data that will be collected.

3.4.1 Machine Translation

We will collect 100K parallel sentences between English and each of the 22 languages. The distribution of these 100K sentences would be as follows:

50K English sentences taken from Wikipedia and government sources from 12 different domains, viz. Legal, Government, History, Geography, Tourism, STEM, Religion, Business, Sports, Entertainment, Health, Culture, News. These sentences would be translated to all the 22 languages to create n-way parallel data. This will ensure that the parallel data has diversity in domains and contains formally written content.

30K English sentences from daily conversations in the Indian context in 20 different domains (e.g., railway stations, Indian tourist spots, etc). These sentences would be translated to all the 22 languages to create n-way parallel data. This will ensure that the parallel data has diversity in domains and contains informally written content with a focus on everyday conversations (a primary use case of speech-to-speech translation systems).

5K English sentences corresponding to reviews of 500 popular products which will be translated to all the 22 languages to create n-way parallel data. This will again ensure that the parallel data has some commercial content and diversity in writing style.

10K English sentences taken from government acts and polices which will be translated to all the 22 languages to create n-2ay parallel data. This will ensure representation of content that is typical translated by government bodies.

5K source regional languages sentences taken from books which were originally written in the regional languages. For each of the 22 languages, such 5K sentences will be translated to English (this will not be n-way parallel).

10% of the above data will be reserved as benchmark data and the rest will be used as training data.

Summary of MT data collection goals

Summary of MT data collection goals

Domain

Source

Number of sentences

Direction

n-way parallel

Potential Challenges

Legal

Government

5000

En-X

Yes

Government Policies

Government

5000

En-X

Yes

History

Wikipedia

5000

En-X

Yes

Geography

Wikipedia

5000

En-X

Yes

Tourism

Wikipedia

5000

En-X

Yes

STEM

Wikipedia

5000

En-X

Yes

Business

Wikipedia

5000

En-X

Yes

Sports

Wikipedia

5000

En-X

Yes

Entertainment

Wikipedia

5000

En-X

Yes

Health

Wikipedia

5000

En-X

Yes

Culture

Wikipedia

5000

En-X

Yes

News

Wikipedia

5000

En-X

Yes

Everyday conversations

AI4Bharat

30000

En-X

Yes

Product Reviews

AI4Bharat

5000

En-X

Yes

Literature

Books

5000

X-En

Source original content is mainly derived from books written in that language. However, taking content from such books may have copyright issues.

There are no rows in this table

Count

⁠

3.4.2 Automatic Speech Recognition

For collecting labeled data for training ASR models, we will adopt two strategies: (i) collect data from the field to ensure speaker diversity and coverage of specific content which is hard to obtain elsewhere (e.g., voice commands) (ii) label existing data from news, entertainment and educational content.

Collecting data from the field: We will collect data from 600 speakers spread across districts wherein each speaker will:

Read 100 sentences (~10 minutes)

Speak 200 voice commands (~10 minutes)

Participate in an extempore get-to-know-me interview (~10 minutes)

Read 100 English sentences (only from a few speakers who also speak English. This will ensure that we also collect Indian accent English data on the field).

This will ensure that we collect data which has (i) high speaker diversity (number and variety), high content diversity (the 100 sentences will come from a larger pool of 50000 diverse sentences from different domains) and (iii) high downstream applicability (voice commands catering to a variety of use cases).

Labelling existing audio/video data: We will label existing data from youtube and the content/media industry. This data can be further split into the following types:

News: This will be primarily sourced from news channels and can be further categorised into the following types:

Headlines: This is content of the type “Top 20 headlines of the hour” which does not have high speaker diversity but has peculiar characteristics like jarring background music.

On-field reporting: This is content of the type “cameraman Prakash ke saath….” which is extempore, has background noise and involves common people on the ground.

Debates: Such content would have diversity in content (CAA, killing of X, foreign policy of India, etc) and will also have peculiar characteristics like emotional outbursts, overlapping chatter, etc

Interviews: Such content would involve a news anchor and 1-2 experts and caters to a variety of topics. The experts do not follow a script so the content has the flavour of natural speech.

Special reports: Such content involves people on the ground and has good vocabulary spanning multiple domains.

Entertainment: This will be primarily sourced from entertainment channels and would include content from different genres: family shows, comedy shows, crime shows, reality shows, cooking shows, travel shows, songs

Education: This will be primarily sourced from education channels and would contain content from STEM, Health and How-to videos.

Call-centre: This will be primarily sourced from call centres catering to one or more of the following domains: agriculture, legal, banking, insurance, health

Our ASR data collection goals are summarised in the Table below.

Summary of ASR data collection goals

Summary of ASR data collection goals

Type of Data

Number of Hours (MR/LR)

Potential Challenges

On-field data collection

300

Read speech

100

The clean sentences to be read by speaker may either come from native language books or from the translations of English sentences collected by AI4Bharat. If its the former then we will have copyright issues. If its the latter then we will have to delay this activity till we have enough sentences translated.

Voice commands

100

Extempore conversation

100

Labelling existing audio data

700/200

For low resource languages such as Assamese, Bodo, Dogri, Kashmiri, Konkani, Maithili, Manipuri, Nepali, Odia, Sanskrit and Santali we will label only 200 hours of existing audio data.

News

200/100

We will depend on content providers such as Prasar Bharati to get the raw audio content. As of now it is not clear whether such content will be shared with us and if it will have enough diversity. We may have to revise these goals based on the availability of such data.

Entertainment

200/50

Education

200/50

Call centre data

100/0

We will depend on government/private call centres to share the raw audio data with us. Given the general privacy concerns around such data, we are not sure if we will get access to it. We may have to revise these goals based on the availability of such data.

There are no rows in this table

Count

⁠

3.4.3 Text-to-speech

The Text-to-speech data will be collected with the help of professional voice artists hired through a production studio. For each language, we will collect 20 hours of data from a male artist and 20 hours of data from a female artist. The artists will be given prompts from multiple domains. The prompts will be derived from the following two sources:

Source original contents from books: As mentioned earlier (section 4.4.1), we will be sourcing around 5000 sentences from books which were originally written in the native language. In addition to translating these sentences to English, we will also use them as prompts for the voice artists. Such content taken from books is typically rich in sentence structure, vocabulary and emotions. Hence, it would be ideal for recording by professional artists.

Translations from multiple domains: As mentioned earlier (section 4.4.1), we will be translating English sentences taken from multiple domains into regional languages. These translated sentences will contain diverse content form multiple domains and will be provide as prompts for the voice artists. This will ensure that the recorded content has a good representation of domains and broader coverage of vocabulary.

Summary of TTS Data Collection Goals

Summary of TTS Data Collection Goals

Prompts

Number of sentences

Characteristics

Hours (appx)

Potential challenges

Source original content

5000

rich in native language structure, vocabulary and emotions

Source original content is mainly derived from books written in that language. However, taking content from such books may have copyright issues.

Translations from multiple domains

15000

good representation of domains and broader coverage of vocabulary

Will depend on the translations collected for MT data and hence there might be some delay before this activity can be started.

There are no rows in this table

⁠

3.4.4 Optical Character Recognition

Our focus will be on creating datasets for two tasks:

Scene Text Recognition: We will create a benchmark of 10000 images for each language. These images would be natural images having text written in (i) a wide variety of fonts, colors, designs and sizes (ii) different orientations (straight, angular, circular, etc) (iii) multiple languages in the same image.

Layout Detection: For this, we will create simulated data where different layout templates will be created and regional language content will be inserted in these layouts. These layouts would (i) have a wide variety of fonts (ii) have content which is italicized/bold (iii) have document structure such as sections, sub-section, indentations, paragraphs, bullet points (iv) would contain figures (the text in the figures should also be recognised) (v) would contain tables which have multiple columns with some columns having multi-column headings and some rows spanning multiple rows. We will be focusing only on machine-generated pdfs and not scanned pdfs.

Summary of OCR (scene) goals

Summary of OCR (scene) goals

Image Source

Text Orientation

Zoom

Image Quality

Lighting condition

Viewpoint of Photographer

Distance of photographer

Number of images

Signboards

straight horizontal, slanting horizontal, straight vertical, slanting vertical, curved

zoomed in, zoomed out, normal

high Resolution, low Resolution

daylight, night, dim light, artificial light

Top, Below, Normal

Far, Close, Normal

1250

Billboards

straight horizontal, slanting horizontal, straight vertical, slanting vertical, curved

zoomed in, zoomed out, normal

high Resolution, low Resolution

daylight, night, dim light, artificial light

Top, Below, Normal

Far, Close, Normal

1250

Movie Posters

straight horizontal, slanting horizontal, straight vertical, slanting vertical, curved

zoomed in, zoomed out, normal

high Resolution, low Resolution

daylight, night, dim light, artificial light

Top, Below, Normal

Far, Close, Normal

1250

Shop Banners

straight horizontal, slanting horizontal, straight vertical, slanting vertical, curved

zoomed in, zoomed out, normal

high Resolution, low Resolution

daylight, night, dim light, artificial light

Top, Below, Normal

Far, Close, Normal

1250

Political Banners

straight horizontal, slanting horizontal, straight vertical, slanting vertical, curved

zoomed in, zoomed out, normal

high Resolution, low Resolution

daylight, night, dim light, artificial light

Top, Below, Normal

Far, Close, Normal

1250

Railway Station Boards

straight horizontal, slanting horizontal, straight vertical, slanting vertical, curved

zoomed in, zoomed out, normal

high Resolution, low Resolution

daylight, night, dim light, artificial light

Top, Below, Normal

Far, Close, Normal

1250

3.4.5 Summary

The chart below summarises our goals as well as the flow of data collection.

⁠

3.5 How will we achieve these goals?

To achieve these goals, we are taking a 4-pronged approach.

Recruit in-house translators: For all the 22 languages, we are hiring a team of 5 junior language experts (translators/annotators/transcribers) and 2 senior language experts. These experts will directly be on the payrolls of AI4Bharat.

Partnering with universities: For Kashmiri, Urdu, Konkani and Marathi we will partner with specific academic institutes (Goa University, Mumbai University, Kashmiri University)

Partnering with the social sector: For 8 languages (Assamese, Bodo, Dogri, Maithili, Manipuri, Nepali, Sanskrit, Santali) we have partnered with entities or individuals working in the social sector.

Outsourcing to data collection agencies: For TTS, where we need to collect studio quality data from professional voice artists, we will be outsourcing the activity to 3S Studio (this was recommended by our colleagues at IITM who have done a fair amount of data collection with them in the past). Similarly, for voice collection for 11 languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu, Urdu) we will be partnering with an external data collection agency. Their help will be needed in collecting voice samples from every district in the country. In addition, for some languages where it is difficult to find language experts, we may partner with some data collection agencies to collect translations also.

The table below summarises our plan for data collection.

Summary of Plan for Data collection

Summary of Plan for Data collection

Language

Machine Translation

ASR

TTS

OCR

Assamese

in-house translators

⁠

Pragyam Foundation

⁠

3S Studio

⁠

Pragyam Foundation

⁠

Bengali

in-house translators

Outsourced

3S Studio

Outsourced

Bodo

in-house translators

⁠

Pragyam Foundation

⁠

3S Studio

⁠

Pragyam Foundation

⁠

Dogri

in-house translators

⁠

J&K Higher Education

⁠

3S Studio

⁠

J&K Higher Education

⁠

Gujarati

in-house translators

Outsourced

3S Studio

Outsourced

Hindi

in-house translators

Outsourced

3S Studio

Outsourced

Kannada

in-house translators

Outsourced

3S Studio

Outsourced

Kashmiri

in-house translators

⁠

Kashmir University

⁠

3S Studio

⁠

Kashmir University

⁠

Konkani

in-house translators

⁠

Goa University

⁠

3S Studio

⁠

Goa University

⁠

Maithili

in-house translators

⁠

Aripana Foundation

⁠

3S Studio

⁠

Aripana Foundation

⁠

Malayalam

in-house translators

Outsourced

3S Studio

Outsourced

Manipuri

in-house translators

⁠

Koru Foundation

⁠

3S Studio

⁠

Koru Foundation

⁠

Marathi

in-house translators

Mumbai University

3S Studio

Mumbai University

Nepali

in-house translators

⁠

Pragyam Foundation

⁠

3S Studio

⁠

Pragyam Foundation

⁠

Odia

in-house translators

TBD

3S Studio

TBD

Punjabi

in-house translators + Outsourced

Outsourced

3S Studio

Outsourced

Sanskrit

in-house translators

⁠

Aripana Foundation

⁠

3S Studio

⁠

Aripana Foundation

⁠

Santali

in-house translators

⁠

Suchana Uttor Chandipur Community Society

⁠

3S Studio

⁠

Suchana Uttor Chandipur Community Society

⁠

Sindhi

in-house translators + Outsourced

TBD

3S Studio

TBD

Tamil

in-house translators

Outsourced

3S Studio

Outsourced

Telugu

in-house translators

Outsourced

3S Studio

Outsourced

Urdu

in-house translators

Outsourced

3S Studio

⁠

Kashmir University

⁠

There are no rows in this table

⁠

3.6 What are our timelines?

Our timelines for data collection are summarised in the Table below

DMU Timelines

DMU Timelines

Name

Deliverables

Y1-Q1 (Apr-Jun 2022)

Develop Shoonya v1, as an open-source tool for collecting MT, ASR and NLU datasets for all the 22 languages.

Set up teams of language experts (annotators, translators, transcribers) for all the 22 languages.

Run pilot for on-field 100 hours of voice data collection for Tamil.

Collect a total of 50K English sentences from diverse domains which will subsequently be translated to 22 Indian languages.

Collect a total of 50K sentences of everyday conversational content in English which will subsequently be translated to 22 Indian languages

Release 1M mined English-X parallel sentences for 11 languages: Bengali, Gujarati, Hindi, Kannada, Malayalam Marathi, Nepali, Punjabi, Tamil, Telugu, Urdu

Release 500 hours of mined ASR data for 11 languages: Bengali, Gujarati, Hindi, Kannada, Malayalam Marathi, Odia, Punjabi, Tamil, Telugu, Urdu

Y1-Q2

12 Phase 1 languages (P1): Assamese, Bengali, Gujarati, Hindi, Kannada, Maithili, Malayalam, Manipuri, Marathi, Sanskrit, Tamil, Urdu.

10 Phase 2 languages (P2): Bodo, Dogri, Kashmiri, Konkani, Nepali, Odia, Punjabi, Santali, Sindhi, Telugu.

Develop Shoonya, as an open-source tool for collecting MT, ASR and NLU datasets for all the 22 languages.

Create a MT benchmark containing 10K En-X parallel sentences for P1 languages.

Create an ASR benchmark of 25 hours for P1 languages containing (a) read speech (b) voice commands (c) transcribed extempore conversations (d) transcribed news content (e) transcribed education content (f) transcribed entertainment content.

Create 10 hours of TTS data for P1 languages.

Release synthetic training data containing 100K images each for document layout detection, document text recognition and scene text recognition for all the 22 languages

Y1-Q3

Develop Shoonya, as an open-source tool for collecting MT, ASR and NLU datasets for all the 22 languages.

Create a MT benchmark containing 10K En-X parallel sentences for P2 languages.

Create an ASR benchmark of 50 hours for P2 languages containing (a) read speech (b) voice commands (c) transcribed extempore conversations (d) transcribed news content (e) transcribed education content (f) transcribed entertainment content.

Create 10 hours of TTS data for P2 languages.

Y1-Q4

Create 30K En-X parallel sentences (fine-tuning data) for all 22 languages

Create 100 hours of ASR data for all 22 languages

Create 10 hours of TTS data for for all 22 languages

Create a benchmark for Scene Text Recognition containing 500 images for all 22 languages (13 scripts)

Create a benchmark for document OCR containing 500 scanned pages for all 22 languages (13 scripts)

Y2-Q1

Create 30K En-X parallel sentences (fine-tuning data) for all 22 languages

Create 100 hours of ASR data for all 22 languages

Create 10 hours of TTS data for for all 22 languages

Create a benchmark for Scene Text Recognition containing additional 500 images for all 22 languages (13 scripts)

Create a benchmark for document OCR containing 500 scanned pages for all 22 languages (13 scripts)

Y2-Q2

Create 40K En-X parallel sentences (fine-tuning data) for all 22 languages

Create 100 hours of ASR data for all 22 languages

Create 10 hours of TTS data for for all 22 languages

Y2-Q3

Create 100 hours of ASR data for all 22 languages

Create 5K QA pairs for all 22 languages

Create 5K NER tagged sentences for all 22 languages

Create 5K sentiment labeled sentences for all 22 languages

Y2-Q4

Create 100 hours of ASR data for all 22 languages

Create 5K QA pairs for all 22 languages

Create 5K NER tagged sentences for all 22 languages

Create 5K sentiment labeled sentences for all 22 languages

Create 100K translated QA pairs (noisy training data) for all 22 languages

Create 100K noisy NER sentences (translation + projection) for all 22 languages

Create 100K translated SA sentences for all 22 languages

Y3-Q1

Build and release version 1 of ASR, TTS, MT, NLU models for P1 languages

Y3-Q2

Build and release version 1 of ASR, TTS, MT, NLU models for P2 languages

Y3-Q3

Build and release version 2 of ASR, TTS, MT, NLU models for P1 languages

Y3-Q4

Build and release version 2 of ASR, TTS, MT, NLU models for P2 languages

There are no rows in this table

⁠

Appendix

A.1 Guidelines for MT data collection

Have requested Prof. Pushpak to share guidelines from past projects.

A.2 Guidelines for ASR data collection

Have requested Prof. Umesh to share guidelines from past projects.

A.3 Guidelines for TTS data collection

Have requested Prof. Hema to share guidelines from past projects.

A.4 Guidelines for OCR data collection

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.