Sie befinden Sich nicht im Netzwerk der Universität Paderborn. Der Zugriff auf elektronische Ressourcen ist gegebenenfalls nur via VPN oder Shibboleth (DFN-AAI) möglich. mehr Informationen...
An Attention-Based Joint Acoustic and Text on-Device End-To-End Model
Ist Teil von
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, p.7039-7043
Ort / Verlag
IEEE
Erscheinungsjahr
2020
Quelle
IEEE Electronic Library Online
Beschreibungen/Notizen
Recently, we introduced a two-pass on-device end-to-end (E2E) speech recognition model, which runs RNN-T in the first-pass and then rescores/redecodes the result using a noncausal Listen, Attend and Spell (LAS) decoder. This on-device model obtained similar performance to a state-of-the-art conventional model. However, like many E2E models, it suffers from being trained only on supervised audio-text pairs and thus performs poorly on rare words compared to a conventional model which incorporates a language model trained on a much larger text corpus. In this work, we introduce a joint acoustic and text decoder (JATD) into the LAS decoder, which makes it possible to incorporate a much larger text corpus into training. We find that the JATD model obtains in a 3-10% relative improvement in WER compared to a LAS decoder trained only on supervised audio-text pairs across a variety of proper noun test sets.