UB Paderborn / Katalog / Suche / Details

Ergebnis 25 von 6950

IEEE journal of selected topics in signal processing, 2020-03, Vol.14 (3), p.530-541

2020

Autor(en) / Beteiligte

Titel

Multi-Modal Multi-Channel Target Speech Separation

Ist Teil von

IEEE journal of selected topics in signal processing, 2020-03, Vol.14 (3), p.530-541

Ort / Verlag

New York: IEEE

Erscheinungsjahr

2020

Link zum Volltext

Quelle

IEEE Xplore

Beschreibungen/Notizen

Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers. Previously the use of visual modality for target speech separation has demonstrated great potentials. This work proposes a general multi-modal framework for target speech separation by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements. Also, under this framework, we investigate on the fusion methods for multi-modal joint modeling. A factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi-modalities at embedding level. This method firstly factorizes the mixture audio into a set of acoustic subspaces, then leverages the target's information from other modalities to enhance these subspace acoustic embeddings with a learnable attention scheme. To validate the robustness of proposed multi-modal separation model in practical scenarios, the system was evaluated under the condition that one of the modalities is temporarily missing, invalid or corrupted. Experiments are conducted on a large-scale audio-visual dataset collected from YouTube (to be released) that spatialized by simulated room impulse responses (RIRs). Experiment results illustrate that our proposed multi-modal framework significantly outperforms single-modal and bi-modal speech separation approaches, while can still support real-time processing.

Sprache: Englisch
Identifikatoren: ISSN: 1932-4553
eISSN: 1941-0484
DOI: 10.1109/JSTSP.2020.2980956
Titel-ID: cdi_crossref_primary_10_1109_JSTSP_2020_2980956

Format: –
Schlagworte: Acoustics, Audio data, Audio equipment, Computer simulation, deep learning, Feature extraction, Lips, multi-modality fusion, Robustness, Separation, Spectrogram, speech enhancement, Speech processing, Speech recognition, Subspaces, Target speech separation, Visualization

Empfehlungen zum selben Thema automatisch vorgeschlagen von bX