UB Paderborn / Katalog / Suche / Details

Zur Ergebnisliste

Ergebnis 23 von 132620

End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition

Speech communication, 2019-04, Vol.108, p.15-32

2019

Details

Autor(en) / Beteiligte

Titel

End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition

Ist Teil von

Speech communication, 2019-04, Vol.108, p.15-32

Ort / Verlag

Amsterdam: Elsevier B.V

Erscheinungsjahr

2019

Link zum Volltext

Quelle

Elsevier ScienceDirect Journals Complete

Beschreibungen/Notizen

•Novel CNN-based end-to-end acoustic modeling approach is proposed.•Relevant features are automatically learned from the signal by discriminating phones.•Learned features are more discriminative than cepstral-based features.•Learned features are somewhat invariant to languages and domains.•Proposed approach leads to better ASR systems. In hidden Markov model (HMM) based automatic speech recognition (ASR) system, modeling the statistical relationship between the acoustic speech signal and the HMM states that represent linguistically motivated subword units such as phonemes is a crucial step. This is typically achieved by first extracting acoustic features from the speech signal based on prior knowledge such as, speech perception or/and speech production knowledge, and, then training a classifier such as artificial neural networks (ANN), Gaussian mixture model that estimates the emission probabilities of the HMM states. This paper investigates an end-to-end acoustic modeling approach using convolutional neural networks (CNNs), where the CNN takes as input raw speech signal and estimates the HMM states class conditional probabilities at the output. Alternately, as opposed to a divide and conquer strategy (i.e., separating feature extraction and statistical modeling steps), in the proposed acoustic modeling approach the relevant features and the classifier are jointly learned from the raw speech signal. Through ASR studies and analyses on multiple languages and multiple tasks, we show that: (a) the proposed approach yields consistently a better system with fewer parameters when compared to the conventional approach of cepstral feature extraction followed by ANN training, (b) unlike conventional method of speech processing, in the proposed approach the relevant feature representations are learned by first processing the input raw speech at the sub-segmental level ( ≈  2 ms). Specifically, through an analysis we show that the filters in the first convolution layer automatically learn “in-parts” formant-like information present in the sub-segmental speech, and (c) the intermediate feature representations obtained by subsequent filtering of the first convolution layer output are more discriminative compared to standard cepstral features and could be transferred across languages and domains.

Sprache: Englisch
Identifikatoren: ISSN: 0167-6393
eISSN: 1872-7182
DOI: 10.1016/j.specom.2019.01.004
Titel-ID: cdi_proquest_journals_2216908102

Weiterführende Literatur

Empfehlungen zum selben Thema automatisch vorgeschlagen von bX

Menü

Weitere Dienste

Einstellungen

End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition

Details

Weiterführende Literatur