Sie befinden Sich nicht im Netzwerk der Universität Paderborn. Der Zugriff auf elektronische Ressourcen ist gegebenenfalls nur via VPN oder Shibboleth (DFN-AAI) möglich. mehr Informationen...
Ergebnis 18 von 314

Details

Autor(en) / Beteiligte
Titel
Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis
Ist Teil von
  • Speech communication, 2018-05, Vol.99, p.135-143
Ort / Verlag
Amsterdam: Elsevier B.V
Erscheinungsjahr
2018
Link zum Volltext
Quelle
Alma/SFX Local Collection
Beschreibungen/Notizen
  • •We study the impact of adding large-scale listener's perceptual annotations into the emotional speech modeling process.•We consider a number of different emotional representations that allow us to exploit this perceptual information. These representations also consider ways of manipulating the modeled emotion at synthesis time.•Two large scale perceptual evaluations were carried out, one to evaluate modeling accuracy and another to evaluate control capabilities at synthesis time.•We prove how adding perceptual information based on listener’s annotation significantly improves emotional speech modeling accuracy.•We also show how the proposed representations provide us with notable emotional control capabilities.•They allow us to control both emotion recognition rates and perceived emotional strength without decreasing produced speech quality. In this paper, we investigate the simultaneous modeling of multiple emotions in DNN-based expressive speech synthesis, and how to represent the emotional labels, such as emotional class and strength, for this task. Our goal is to answer two questions: First, what is the best way to annotate speech data with multiple emotions – should we use the labels that the speaker intended to express, or labels based on listener perception of the resulting speech signals? Second, how should the emotional information be represented as labels for supervised DNN training, e.g., should emotional class and emotional strength be factorized into separate inputs or not? We evaluate on a large-scale corpus of emotional speech from a professional voice actress, additionally annotated with perceived emotional labels from crowdsourced listeners. By comparing DNN-based speech synthesizers that utilize different emotional representations, we assess the impact of these representations and design decisions on human emotion recognition rates, perceived emotional strength, and subjective speech quality. Simultaneously, we also study which representations are most appropriate for controlling the emotional strength of synthetic speech.

Weiterführende Literatur

Empfehlungen zum selben Thema automatisch vorgeschlagen von bX