AI4Bharat

Explore

Gallery

AI4Bharat

IndicASR

Adaptation in End-to-End Speech Recognition

Adaptation techniques are summarized this survey paper -

https://arxiv.org/pdf/2111.01690.pdf⁠

⁠

The performance of ASR systems can degrade significantly when the test conditions differ from training.

Domain Adaptation: Adapt ASR models to the target domain which has content mismatch from the source domain in which the ASR models were trained.

Customization / Contextualization: Leverage context such as contacts, location, music playlist etc., of a specific user to significantly boost the ASR accuracy for this user.

Speaker Adaptation: Adapt ASR models to better recognize a target speaker’s speech.

Domain Adaptation

Challenges

E2E models tend to memorize the training data well, their performance usually degrades a lot in a new domain.

Big challenge of adapting E2E models to a new domain is that it is not easy to get enough paired speech-text data in the new domain. However, it is much easier to get text data in the new domain.

Approaches

Shallow Fusion of LM: External LM and ASR model remain separate and only their scores are combined, similarly to an ensemble (Most Popular)

Deep Fusion of LM: External LM is fused directly into the ASR model by combining their hidden states, resulting in a single model with tight integration (Assumes a neural LM)

Cold Fusion of LM: Deep Fusion uses a late integration where both the ASR and LM models are trained separately and then combined, while cold fusion uses the external pretrained LM model from the very start of the ASR model training. An important point is that early training integration approaches are computationally costly if either of the two models is frequently changing.

TTS-based Adaptation: LM fusion methods require interpolating with an external LM, both the computational cost and footprint are increased. With the advance of TTS technologies, a new trend is to adapt E2E models with the synthesized speech generated from the new-domain text.

This has especially useful for adapting RNN-T models.

Drawbacks-

TTS speech is different from the real speech. It sometimes also degrades the recognition accuracy on real speech.

Speaker variation in the TTS data is far less

Cost of training a multi-speaker TTS model and the generation of synthesized speech from the model is large

Recent works address these drawbacks

Papers

A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition (SLT 2018; Toyota Technological Institute at Chicago, Google Inc)

Results show that almost all of the LM integration approaches improve over the baseline encoder-decoder model for all data sets, confirming the benefit of utilizing unpaired text

The rather simple approach of shallow fusion works best for first-pass decoding on all of our data sets

Deep fusion doesn’t scale well, obtaining no or negligible gains over baseline for large-scale Google data sets

⁠

⁠

Internal language model estimation for domain-adaptive end-to-end speech recognition (SLT 2021; Microsoft Corporation, Redmond) [ source code unavailable ]

Propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models with no additional model training, including the most popular RNN-T and AED models.

Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the training data in the source domain.

The internal LM scores of an E2E model are estimated and subtracted from the log-linear interpolation between the scores of the E2E model and the external LM

⁠

ILME reduces the WER of Shallow Fusion by 8.1%-15.5% relatively in cross-domain evaluations, and 2.4%-6.8% relatively in intra-domain evaluations.

⁠

Customization

Challenges

Eg. An English ASR system usually cannot recognize the contact names of a Indian person well. However, if the English ASR system is presented with the contact list of this Indian person, the ASR output can be biased toward the contact names.

Approaches / Papers

Provide the context as a biasing list to the model

List of papers in this area -

https://github.com/stevenhillis/awesome-asr-contextualization⁠

⁠

External Contextualization

Hotwords

Similar to rescoring

⁠

Improve domain specificity by adding important contextual words ("hotwords") during inference

Supported by PyCTCDecode -

https://github.com/kensho-technologies/pyctcdecode⁠

⁠

Shallow-Fusion End-to-End Contextual Biasing (Interspeech 2019; Google, Inc)

Deep Contextualization

Deep context: end-to-end contextual speech recognition (SLT 2018; Google Inc)

⁠

Contextual RNN-T for open domain ASR (Interspeech, 2020; Facebook AI)

Speaker Adaptation

Challenges

Biggest challenge of speaker adaptation is that the adaptation data amount from the target speaker is usually very small

Approach

Finetuning-approach: Perform finetuning on the target-speaker with additional regularization techniques like weight decay, KL Divergence regularization, etc.

TTS-based approach: Relieves the general data sparsity issue in rapid adaptation via making use of additional synthesized personalized speech. Circumvents the obstacle of the explicit labeling error in unsupervised adaptation by converting it to pseudo-supervised adaptation

Papers

Speaker Adaptation for end-to-end CTC Models (SLT 2018; Microsoft AI and Research, JHU)

Propose two approaches for speaker adaptation

Kullback-Leibler divergence (KLD) regularization

Multi-task learning (MTL)

Using personalized speech synthesis and neural language generator for rapid speaker adaptation (ICASSP 2020; Microsoft Corporation)

Relieves the general data sparsity issue in rapid adaptation via making use of additional synthesized personalized speech

Circumvents the obstacle of the explicit labeling error in unsupervised adaptation by converting it to pseudo-supervised adaptation

Gallery

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.