Gallery
AI4Bharat
Share
Explore
IndicASR

RNN-T

image.png

Memory problems: Suppose we have T=1000, U=100, and L=1000 labels, and batch size B=32. Then to store h^(t,u) for all (t,u) to run the forward-backward algorithm, we need a tensor of size B×T×U×L= 3,200,000,000, or 12.8 GB if we’re using single-precision floats. ().
^ CTC and LAS have two dimensional lattices (ie dependent on T, L only) but RNN-T’s lattice probabilties are three dimensional (dependence on T, U and L)
Choice of subwords vs characters as output units: The number of labels L increase if we use subwords as output units. Hence characters are more popular.
Can a pretrained LM be directly used for the text predictor?
Looks like this doesn't work well as the LM is not trained to produce blank symbols. A proposes to split the text predictor into 2 parts - one for predicting blank and another for output labels (which is similar to LM).
image.png
Tools: (very recent paper Jan 2022) covers some details on rnn-t architechture, losses and a note on recent toolkits

Pratyush’s thoughts
Since the key output is letters, Indian languages may have an advantage - each letter is mapped to a sound
is there a way to pretrain text encoder?
what is the sequence length and batch size for training - both will significantly increase the gpu memory required?
can we do checkpointing for training rnn-t models to reduce memory requirement?
personalizing models the way mahaveer presented seem to be deep fusion - need to ensure that we train with this module in place. An alternative is shallow fusion, where we bias the probabilities in beam search decoder by using a trie of contextual dictionary. we can even bias the terms in the contextual dictionary based on probabiliy.
What are the ways of restricting the lattice complexity

Plan to train an Indic RNN-T ASR model

@Gowtham Ramesh
List down the steps required for training a model
Choice of framework - Espnet (see for comparison of features among different toolkits)
Choice of hyperparams like sequence length, etc.
Choice of dataset
Multilingual or monolingual
Identify hardware / training time required

Literature survey on RNN-T for indian languages:
RNN-T approaches
Paper
Organization
Dataset
Languages (utts or num hours)
Architechture (hidden units)
Param
1
Microsoft India
Internal
en (10K h), hi (10K h), ta (1K h), gu (1.3K h)
enc - 6 layer lstm (1024) dec - 2 layer lstm (1024)
70-110M
2
Microsoft India
Internal
en-US ( 65K h), hi (4M u)
enc - 6 layer lstm (1600) dec - 2 layer lstm (1600)
105M
3
Google
Internal
hi (16M u), mr (4M), bn (3.9M), te (2.4 M), gu (2.2M), ta (1.8M), ml (1.5M), kn (1.2 M), ur (0.44M)
enc - 8 layer lstm (2048) dec - 2 layer lstm (2048)
120M - 140M (2.5M params per language adapter)
There are no rows in this table

Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.