AI4Bharat

Explore

Gallery

AI4Bharat

IndicASR

RNN-T

⁠

Memory problems: Suppose we have T=1000, U=100, and L=1000 labels, and batch size B=32. Then to store h^(t,u) for all (t,u) to run the forward-backward algorithm, we need a tensor of size B×T×U×L= 3,200,000,000, or 12.8 GB if we’re using single-precision floats. (

source⁠

^ CTC and LAS have two dimensional lattices (ie dependent on T, L only) but RNN-T’s lattice probabilties are three dimensional (dependence on T, U and L)

Choice of subwords vs characters as output units: The number of labels L increase if we use subwords as output units. Hence characters are more popular.

Can a pretrained LM be directly used for the text predictor?

Looks like this doesn't work well as the LM is not trained to produce blank symbols. A

recent paper⁠

proposes to split the text predictor into 2 parts - one for predicting blank and another for output labels (which is similar to LM).

⁠

Tools:

A STUDY OF TRANSDUCER BASED END-TO-END ASR WITH ESPNET⁠

(very recent paper Jan 2022) covers some details on rnn-t architechture, losses and a note on recent toolkits

Pratyush’s thoughts

Since the key output is letters, Indian languages may have an advantage - each letter is mapped to a sound

is there a way to pretrain text encoder?

what is the sequence length and batch size for training - both will significantly increase the gpu memory required?

can we do checkpointing for training rnn-t models to reduce memory requirement?

personalizing models the way mahaveer presented seem to be deep fusion - need to ensure that we train with this module in place. An alternative is shallow fusion, where we bias the probabilities in beam search decoder by using a trie of contextual dictionary. we can even bias the terms in the contextual dictionary based on probabiliy.

What are the ways of restricting the lattice complexity

⁠

Plan to train an Indic RNN-T ASR model

⁠

@Gowtham Ramesh⁠

List down the steps required for training a model

Choice of framework - Espnet (see

table1⁠

for comparison of features among different toolkits)

Choice of hyperparams like sequence length, etc.

Choice of dataset

Multilingual or monolingual

Identify hardware / training time required

Literature survey on RNN-T for indian languages:

RNN-T approaches

RNN-T approaches

Paper

Organization

Dataset

Languages (utts or num hours)

Architechture (hidden units)

Param

Multisoftmax

Microsoft India

Internal

en (10K h), hi (10K h), ta (1K h), gu (1.3K h)

enc - 6 layer lstm (1024) dec - 2 layer lstm (1024)

70-110M

Transfer Learining approaches

Microsoft India

Internal

en-US ( 65K h), hi (4M u)

enc - 6 layer lstm (1600) dec - 2 layer lstm (1600)

105M

Multilgual training + langid + adapter

Google

Internal

hi (16M u), mr (4M), bn (3.9M), te (2.4 M), gu (2.2M), ta (1.8M), ml (1.5M), kn (1.2 M), ur (0.44M)

enc - 8 layer lstm (2048) dec - 2 layer lstm (2048)

120M - 140M (2.5M params per language adapter)

There are no rows in this table

⁠

Gallery

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.