AES Budapest 2012
Paper Session P13
P13 - Analysis and Synthesis: Part 2; Content Management
Friday, April 27, 14:00 — 16:30 (Room: Liszt)
Chair:
Michael Kelly
P13-1 Overview of Feature Selection for Automatic Speech Recognition—Branislav Gerazov, Zoran Ivanovski, Faculty of Electrical Engineering and Information Technologies - Skopje, Macedonia
The selection of features to be used for the task of Automatic Speech Recognition (ASR) is critical to the overall performance of the ASR system. Throughout the history of development of ASR systems, a variety of features have been proposed and used, with greater or lesser success. Still, the research for new features, as well as modifications to the traditional ones, continues. Newly proposed features as well as traditional feature optimization focus on adding robustness to ASR systems, which is of great importance for applications involving noisy environments. The paper seeks to give a general overview of the various features that have been used in ASR systems, giving details to an extent granted by the space available.
Convention Paper 8634 (Purchase now)
P13-2 Evaluating the Influence of Source Separation Methods in Robust Automatic Speech Recognition with a Specific Cocktail-Party Training—Amparo Marti, Universitat Politècnica de València - València, Spain; Maximo Cobos, University of Valencia - Burjassot (Valencia), Spain; Jose J. Lopez, Universitat Politècnica de València - València, Spain
Automatic Speech Recognition (ASR) allows a computer to identify the words that a person speaks into a microphone and convert it to written text. One of the most challenging situations for ASR is the cocktail party environment. Although source separation methods have already been investigated to deal with this problem, the separation process is not perfect and the resulting artifacts pose an additional problem to ASR performance in case of using separation methods based on time-frequency masks. Recently, the authors proposed a specific training method to deal with simultaneous speech situations in practical ASR systems. In this paper we study how the speech recognition performance is affected by selecting different combinations of separation algorithms both at the training and test stages of the ASR system under different acoustic conditions. The results show that, while different separation methods produce different types of artifacts, the overall performance of the method is always increased when using any cocktail-party training.
Convention Paper 8635 (Purchase now)
P13-3 Automatic Regular Voice, Raised Voice, and Scream Recognition Employing Fuzzy Logic—Kuba LopatkaAndrzej Czyzewski, Gdansk University of Technology - Gdansk, Poland
A method of automatic recognition of regular voice, raised voice, and scream used in an audio surveillance system is presented. The algorithm for detection of voice activity in a noisy environment is discussed. Signal features used for sound classification, based on energy, spectral shape, and tonality are introduced. Sound feature vectors are processed by a fuzzy classifier. The method is employed in an audio surveillance system working in eal-time both in an indoor and outdoor environment. Achieved results of classifying real signals are presented and discussed.
Convention Paper 8636 (Purchase now)
P13-4 Enhanced Chroma Feature Extraction from HE-AAC Encoder—Marco Fink, University of Erlangen-Nuremberg - Erlangen, Germany; Arijit Biswas, Dolby Germany GmbH - Nuremberg, Germany; Walter Kellermann, University of Erlangen-Nuremberg - Erlangen, Germany
A perceptually enhanced chroma feature extraction during the HE-AAC audio encoding process is proposed. Extraction of chroma features from the MDCT-domain spectra of the encoder and its further enhancement utilizing the perceptual model of the encoder is investigated. The main advantage of such a scheme is a reduced computational complexity when both chroma feature extraction and encoding is desired. Specifically, the system is designed to produce reliable chroma features irrespective of the block switching decision of the encoder. Three methods are discussed to circumvent the poor frequency resolution during short blocks. All proposed enhancements are evaluated systematically within a well-known state-of-the-art chord recognition framework.
Convention Paper 8637 (Purchase now)
P13-5 Hum Removal Filters: Overview and Analysis—Matthias Brandt, Jörg Bitzer, Jade University of Applied Sciences - Oldenburg, Germany
In this contribution we analyze different methods for removing sinusoidal disturbances from audio recordings. In order to protect the desired signal, high frequency selectivity of the used filters is necessary. However, due to the time-bandwidth uncertainty principle, high frequency selectivity brings about long impulse responses. This can result in audibly resonating filters, causing artifacts in the output signal. Thus, the choice of the optimal algorithm is a compromise between frequency selectivity and acceptable time domain behavior. In this context, different filter structures and algorithms have different characteristics. To investigate their influence on the hum disturbance and the desired signal, we have evaluated three methods using objective error measures to illustrate advantages and drawbacks of the individual approaches.
Convention Paper 8638 (Purchase now)