The performance of ASR systems can degrade significantly when the test conditions differ from training.
Domain Adaptation: Adapt ASR models to the target domain which has content mismatch from the source domain in which the ASR models were trained.
Customization / Contextualization: Leverage context such as contacts, location, music playlist etc., of a specific user to significantly boost the ASR accuracy for this user.
Speaker Adaptation: Adapt ASR models to better recognize a target speaker’s speech.
Domain Adaptation
Challenges
E2E models tend to memorize the training data well, their performance usually degrades a lot in a new domain.
Big challenge of adapting E2E models to a new domain is that it is not easy to get enough paired speech-text data in the new domain. However, it is much easier to get text data in the new domain.
Approaches
Shallow Fusion of LM: External LM and ASR model remain separate and only their scores are combined, similarly to an ensemble (Most Popular)
Deep Fusion of LM: External LM is fused directly into the ASR model by combining their hidden states, resulting in a single model with tight integration (Assumes a neural LM)
Cold Fusion of LM: Deep Fusion uses a late integration where both the ASR and LM models are trained separately and then combined, while cold fusion uses the external pretrained LM model from the very start of the ASR model training. An important point is that early training integration approaches are computationally costly if either of the two models is frequently changing.
TTS-based Adaptation: LM fusion methods require interpolating with an external LM, both the computational cost and footprint are increased. With the advance of TTS technologies, a new trend is to adapt E2E models with the synthesized speech generated from the new-domain text.
This has especially useful for adapting RNN-T models.
Drawbacks-
TTS speech is different from the real speech. It sometimes also degrades the recognition accuracy on real speech.
Speaker variation in the TTS data is far less
Cost of training a multi-speaker TTS model and the generation of synthesized speech from the model is large
Recent works address these drawbacks
Papers
A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition (SLT 2018; Toyota Technological Institute at Chicago, Google Inc)
Results show that almost all of the LM integration approaches improve over the baseline encoder-decoder model for all data sets, confirming the benefit of utilizing unpaired text
The rather simple approach of shallow fusion works best for first-pass decoding on all of our data sets
Deep fusion doesn’t scale well, obtaining no or negligible gains over baseline for large-scale Google data sets
Internal language model estimation for domain-adaptive end-to-end speech recognition (SLT 2021; Microsoft Corporation, Redmond) [ source code unavailable ]
Propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models with no additional model training, including the most popular RNN-T and AED models.
Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the training data in the source domain.
The internal LM scores of an E2E model are estimated and subtracted from the log-linear interpolation between the scores of the E2E model and the external LM
ILME reduces the WER of Shallow Fusion by 8.1%-15.5% relatively in cross-domain evaluations, and 2.4%-6.8% relatively in intra-domain evaluations.
Customization
Challenges
Eg. An English ASR system usually cannot recognize the contact names of a Indian person well. However, if the English ASR system is presented with the contact list of this Indian person, the ASR output can be biased toward the contact names.
Approaches / Papers
Provide the context as a biasing list to the model
Deep context: end-to-end contextual speech recognition (SLT 2018; Google Inc)
Contextual RNN-T for open domain ASR (Interspeech, 2020; Facebook AI)
Speaker Adaptation
Challenges
Biggest challenge of speaker adaptation is that the adaptation data amount from the target speaker is usually very small
Approach
Finetuning-approach: Perform finetuning on the target-speaker with additional regularization techniques like weight decay, KL Divergence regularization, etc.
TTS-based approach: Relieves the general data sparsity issue in rapid adaptation via making use of additional synthesized personalized speech. Circumvents the obstacle of the explicit labeling error in unsupervised adaptation by converting it to pseudo-supervised adaptation
Papers
Speaker Adaptation for end-to-end CTC Models(SLT 2018; Microsoft AI and Research, JHU)
Propose two approaches for speaker adaptation
Kullback-Leibler divergence (KLD) regularization
Multi-task learning (MTL)
Using personalized speech synthesis and neural language generator for rapid speaker adaptation(ICASSP 2020; Microsoft Corporation)
Relieves the general data sparsity issue in rapid adaptation via making use of additional synthesized personalized speech
Circumvents the obstacle of the explicit labeling error in unsupervised adaptation by converting it to pseudo-supervised adaptation