UB Paderborn / Katalog / Suche / Details

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017, p.381-388

2017

Volltextzugriff (PDF)

Autor(en) / Beteiligte

Titel

Seeing and hearing too: Audio representation for video captioning

Ist Teil von

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017, p.381-388

Ort / Verlag

IEEE

Erscheinungsjahr

2017

Quelle

IEEE Electronic Library Online

Beschreibungen/Notizen

Video captioning has been widely researched. Most related work takes into account only visual content in generating descriptions. However, auditory content such as human speech or environmental sounds contains rich information for describing scenes, but has yet to be widely explored for video captions. Here, we experiment with different ways to use this auditory content in videos, and demonstrate improved caption generation in terms of popular evaluation methods such as BLEU, CIDEr, and METEOR. We also measure the semantic similarities between generated captions and human-provided ground truth using sentence embeddings, and find that good use of multi-modal contents helps the machine to generate captions that are more semantically related to the ground truth. When analyzing the generated sentences, we find some ambiguous situations for which visual-only models yield incorrect results but that are resolved by approaches that take into account auditory cues.

Sprache: Englisch
Identifikatoren: DOI: 10.1109/ASRU.2017.8268961
Titel-ID: cdi_ieee_primary_8268961

Format: –
Schlagworte: Decoding, Encoding, Feature extraction, Mel frequency cepstral coefficient, Speech, Training, Video caption generation, Visualization

Empfehlungen zum selben Thema automatisch vorgeschlagen von bX