Memory problems: Suppose we have T=1000, U=100, and L=1000 labels, and batch size B=32. Then to store h^(t,u) for all (t,u) to run the forward-backward algorithm, we need a tensor of size B×T×U×L= 3,200,000,000, or 12.8 GB if we’re using single-precision floats. (). ^ CTC and LAS have two dimensional lattices (ie dependent on T, L only) but RNN-T’s lattice probabilties are three dimensional (dependence on T, U and L)
Choice of subwords vs characters as output units: The number of labels L increase if we use subwords as output units. Hence characters are more popular.
Can a pretrained LM be directly used for the text predictor?
Looks like this doesn't work well as the LM is not trained to produce blank symbols. A proposes to split the text predictor into 2 parts - one for predicting blank and another for output labels (which is similar to LM). Tools: (very recent paper Jan 2022) covers some details on rnn-t architechture, losses and a note on recent toolkits
Since the key output is letters, Indian languages may have an advantage - each letter is mapped to a sound is there a way to pretrain text encoder? what is the sequence length and batch size for training - both will significantly increase the gpu memory required? can we do checkpointing for training rnn-t models to reduce memory requirement? personalizing models the way mahaveer presented seem to be deep fusion - need to ensure that we train with this module in place. An alternative is shallow fusion, where we bias the probabilities in beam search decoder by using a trie of contextual dictionary. we can even bias the terms in the contextual dictionary based on probabiliy. What are the ways of restricting the lattice complexity
Plan to train an Indic RNN-T ASR model
List down the steps required for training a model
Choice of framework - Espnet (see for comparison of features among different toolkits) Choice of hyperparams like sequence length, etc. Multilingual or monolingual Identify hardware / training time required
Literature survey on RNN-T for indian languages:
Languages (utts or num hours)
Architechture (hidden units)