Mapping voice gender and emotion to acoustic properties of natural speech

Oh, Eunmi and Lee, Jaeeun and Lee, Dayoung

AES E-Library

Mapping voice gender and emotion to acoustic properties of natural speech

This study is concerned with listener’s natural ability to identify an anonymous speaker’s gender and emotion from voice alone. We attempt to map psychological characteristics of the speaker, such as gender image and emotion, to acoustical properties. The acoustical parameters of voice samples were pitch (mean, maximum, and minimum), pitch variation over time, jitter, shimmer, and Harmonics-to-Noise Ratio (HNR). Participants listened to 2-second voice clips and were asked to rate each voice’s gender image and emotion using a 7-point scale. Emotional responses were obtained for 7 opposite pairs of affective attributes (Goble and Ni Chasaide, 2003). The pairs of affective attributes were relaxed/stressed, content/angry, friendly/hostile, sad/happy, bored/interested, intimate/formal, and timid/confident. Experimental results show that listeners were able to identify voice gender and assess emotional status from short utterances. Statistical analyses revealed that these acoustic parameters were related to listeners’ perception of a voice’s gender image and its affective attributes. For voice gender perception, there were significant correlations with jitter, shimmer, and HNR parameters in addition to pitch parameters. For perception of affective attributes, acoustic parameters were analyzed with respect to the valence-arousal dimension. Voices perceived as positive tended to have higher variance in pitch and higher maximum pitch than those perceived as negative. Voices perceived as strongly active tended to have higher number of voice breaks, jitter, shimmer, and lower HNR than those perceived as passive. We expect that our experimental results on mapping acoustical parameters with voice gender and emotion perception could be applied to the field of Artificial Intelligence (AI) when assigning specific tone or quality to voice agents. Moreover, such psycho-acoustical mapping can improve the naturalness of synthesized speech, especially neural TTS (Text-To-Speech), because it can assist in selecting the appropriate speech database for voice interaction and for situations where certain voice gender and affective expressions are needed.

Author (s): Oh, Eunmi; Lee, Jaeeun; Lee, Dayoung;
Affiliation: Yonsei University, Seoul, Korea (See document for exact affiliation information.)
AES Convention: 150 Paper Number:10461
Publication Date: 2021-05-06
Session subject: Psychology

DOI:

This paper costs $33 for non-members and is free for AES members and E-Library subscribers.

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member Join the AES. If you need to check your member status, login to the Member Portal.

Type: Convention Paper

AES Conventions

AES Conferences

AES Training & Development

Gift Membership

AES Membership Benefits

Gift Membership

AES Membership Benefits

Become a Sustaining Member

AES Membership Benefits

AES Inside Track

Journal of the AES

AES E-library

AES Sections are active around the world and provide a means for members to meet locally.

AES Student Website

AES Educational Foundation

Student Sections

See the committee’s accomplishments in diversity & inclusion

AES Statement of solidarity

Richard C. Heyser Memorial Lecture Series

AES E-Library

Mapping voice gender and emotion to acoustic properties of natural speech

Choose your country of residence from this list:

AES E-Library

Login Institutions

Mapping voice gender and emotion to acoustic properties of natural speech

Choose your country of residence from this list: