AES E-Library

AES E-Library

Long-Term Fundamental Frequency Modeling Based on Wavelet Packet Transform for Voice Conversion

Document Thumbnail

Prosody conversion is an important part in voice conversion, where fundamental frequency (F0), which carries important speaker individuality information (e.g., tone, intonation, etc.), is regarded as one of the key prosodic features in the excitation model for speech synthesis. In a conventional approach based on continuous wavelet transform for modeling F0, analysis is carried out on a frame level and is prone to losing high-frequency information in the process of decomposition and reconstruction. In order to address this problem, the paper shows a representation of long-term fundamental frequency based on Wavelet Packet Transform (WPT). Specifically, the long-term F0 is decomposed usingWPT, and a joint vector is formed by combining the resulted average power spectrum. Furthermore, the method is applied in a voice conversion system. Voice conversion experiments are conducted on Chinese and English speech data to evaluate the performance of the proposed method. The results show that the proposed method is obviously better than the method based on wavelet transform in all conversion scenarios but performs a little worse than the method based on mean and variance in same-gender conversion scenario.

Authors:
Affiliations:
JAES Volume 72 Issue 3 pp. 161-169; March 2024
Publication Date:
Permalink: https://www.aes.org/e-lib/browse.cfm?elib=22387

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!

This paper costs $33 for non-members and is free for AES members and E-Library subscribers.

Learn more about the AES E-Library

E-Library Location:

DOI:

Start a discussion about this paper!


AES - Audio Engineering Society