6. Solutions

In the first year, we will focus on developing two solutions, Anuvaad (translation) and Chitralekha (same language subtitling).

6.1 Chitralekha - Indian Language Subtitling tool

6.1.1 Background

Same Language Subtitling (SLS) is an evidence-based Indian Innovation that started in 1996 as a research project at Indian Institute of Management, Ahmedabad (IIM-A). SLS is globally the first initiative that guarantees daily and lifelong reading practice to one 100 Crore TV viewers in India who watch 4 hours of TV every day. SLS is the idea of subtitling Audio-Visual (AV) content in the ‘same’ language as the audio. What you hear is what you read. In other words, SLS asks for Hindi subtitles on Hindi content, Tamil subtitles on Tamil content and likewise on all existing and popularly watched Indian language content.

In India, 1500 odd films are released every year. Close to a billion Indians currently watch nearly four and a half hours of TV every day (

CII-BCG, 2020⁠

). Per capita TV viewing is staggering and has grown 6.7% annually since 2018. Seventy-seven percent of TV time is predominantly spent on entertainment (

FICCI-EY, 2019⁠

). The TV screen is likely to command lifelong attention for BIRD’s primary target group of 600 million weak readers and the secondary target group of 250 million non-readers who could improve but will still be weak readers in the future.

A universal application of SLS on the dialog and songs of films, serials, and such entertainment content in the system of the future is the equivalent of switching on reading practice for a billion viewers in their language, conservatively, for 2+ hours per day. Over an average Indian lifespan of 70 years, no system, not even schools, can match this quantum of time on reading from SLS.

6.1.2 Proposed solution

We propose to design Chitralekha as a cloud-powered desktop tool for accurate and scalable subtitling in Indian languages. Chitralekha will allow a user to import a video - either from the local file store or from an online platform like YouTube. Then it will execute AI4Bharat’s ASR models to generate subtitles along with timestamps. The UI will visualize the audio waveform and mark on the waveform the timestamps and the text for each entry in the subtitle file. It will optionally visualize the confidence of the ASR engine in the transcription. The transcriber can manually adjust the timestamps on the audio timeline. In addition, the transcriber can edit each entry in the subtitle by either choosing an alternative phrase suggested by the ASR engine or by typing out content using the transliteration engine. The application would auto-save all changes made by the transcriber, and then allow a second human pass if required to go through the saved content and further make corrections.

One of the significant advantages of AI engines is that they improve as more quality data is provided to them. We want to architect this as a feature in Chitralekha. Specifically, as transcribers interact with the tool and correct errors in the transcription of the ASR engine or as they use the transliteration options, data can be logged to re-train and improve the accuracy of ASR and transliteration. Thus, the system should be designed from start to evolve as transcribers use it.

⁠

In sum, the architecture of Chitralekha can be summarized as follows. The application runs on a web browser and allows a maker-checker flow through two human passes. The app itself is hosted on a public cloud and maintains all content on it. The app also interfaces with the ASR and transliteration engines which are run also from cloud instances. The interaction between transcribers and Chitralekha are captured in logs which are sent to a GPU cluster which can retrain the models periodically and update the engines accessible by Chitralekha.

6.1.3 Open-source commitment

All software artifacts created from the project will be open-sourced allowing adoption in the movie industry and within government and NGOs to support SLS more widely. Thus, the ASR and transliteration engines, the Chitralekha interface and backend will all be open-sourced. In addition, we aim to open-source the media content and the generated subtitles as far as possible. Specifically, we expect that content owners agree to share a small fraction (about 5%) of the subtitles and the corresponding audio sampled randomly from the content to be open-sourced. The primary aim of this is to ensure that external efforts to improve AI engines for Indian languages, for applications ranging from healthcare to finance, can improve from the data collected in this project.

6.1.4 Activities Planned

The technology part of the project can be split into three components -

Building ASR engines that have a word error rate (WER) of around 15% for each supported language: As identified in

1180995.1181005 (acm.org)⁠

, a WER of 25% is the upper-bound beyond which ASR is not useful during subtitling. The paper also discusses that bringing down the WER leads to increased productivity of the transcribers. We thus aim to bring down the WER for all supported languages to around 15%, i.e., out of 100 transcribed words a maximum of 15 would require editing. This activity requires collection of more sources of speech data in each supported language and training more accurate AI engines with them.

Building transliteration engines which have an accuracy of 85% in the top-5 recommendations in each language: As discussed, the transliteration engine would show multiple variants of a word typed in Roman characters by a transcriber. The transcriber can then choose one amongst these suggestions. We aim to improve the transliteration engines such that the word intended by the transcriber is amongst a list of 5 suggestions 85% of the time. This will require collecting more transliteration pairs and training more accurate transliteration models.

Building the Chitralekha user-interface and system architecture: As discussed, the value of the Chitralekha application is sensitively dependent on the efficiency and ease with which transcribers are able to subtitle audio/video. Thus, the user-interface needs to be designed to be usable, intuitive, efficient, and accessible. We will use modern web technologies and composable backend solutions. The system architecture would also support iteratively improving the quality of ASR and transliteration engines based on logs of data from the usage of Chitralekha. This will require setting up efficient model retraining on GPU clusters.

6.1.5 Implementation Plan

We plan to roll out our tech solution for SLS in 2 phases mainly to ensure error free high quality implementation with smooth processes.

Phase 1: We plan to start with the top 5 languages namely Hindi, Telugu, Tamil, Bengali & Marathi. These 5 languages have maximum speakers as well as the largest movie industries in India. Timeframe: 18 months

⁠

Phase 2 & Future Scale Up: In this phase, we will add the next 7 most popular languages in terms of speakers and size of the movie industry namely Gujarati, Urdu, Kannada, Odia, Malayalam, Punjabi, and Assamese. As per the 2011 census, the 12 languages we cover in phases 1 and 2 will reach about

94% of the total population⁠

in India.

For each of the 12 languages in phase 1 and 2, our aim is to subtitle 1,000 hours of content. At this scale ours will be the largest effort in Indian language subtitling and one of the largest efforts in human-computer interaction with AI tools for Indian language technologies.

Our long term goal is to ensure the entire TV and film industry has a robust technology for large scale implementation of SLS in all languages in India. In phases 1 and 2, we will already reach in the range of 350 million weak readers and people with hearing disabilities. By expanding SLS to all languages in the country, we can turn on reading, language learning and media accessibility for 600 million weak readers and another 63 million deaf and hard of hearing people.

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.