AES Rome 2013
Poster Session P5
P5 - Speech Processing
Saturday, May 4, 15:00 — 16:30 (Foyer)
P5-1 A Speech-Based System for In-Home Emergency Detection and Remote Assistance—Emanuele Principi, Universitá Politecnica delle Marche - Ancona, Italy; Danilo Fuselli, FBT Elettronica Spa - Recanti (MC), Italy; Stefano Squartini, Università Politecnica delle Marche - Ancona, Italy; Maurizio Bonifazi, FBT Elettronica Spa - Recanti (MC), Italy; Francesco Piazza, Universitá Politecnica della Marche - Ancona (AN), Italy
This paper describes a system for the detection of emergency states and for the remote assistance of people in their own homes. Emergencies are detected recognizing distress calls by means of a speech recognition engine. When an emergency is detected, a phone call is automatically established with a relative or friend by means of a VoIP stack and an Acoustic Echo Canceller. Several low-consuming embedded units are distributed throughout the house to monitor the acoustic environment, and one central unit coordinates the system operation. This unit also integrates multimedia content delivery services and home automation functionalities. Being an ongoing project, this paper describes the entire system and then focuses on the algorithms implemented for the acoustic monitoring and the hands-free communication services. Preliminary experiments have been conducted to assess the performance of the recognition module in noisy and reverberated environments and the out of grammar rejection capabilities. Results showed that the implemented Power Normalized Cepstral Coefficients extraction pipeline improves the word recognition accuracy in noisy and reverberated conditions, and that introducing a "garbage phone" in the acoustic model allows to effectively reject out of grammar words and sentences.
Convention Paper 8828 (Purchase now)
P5-2 Assessment of Speech Quality in the Digital Audio Broadcasting (DAB+) System—Stefan Brachmanski, Wroclaw University of Technology - Wroclaw, Poland; Maurycy Kin, Wroclaw University of Technology - Wroclaw, Poland
The methods for assessment of speech quality fall into two classes: subjective and objective methods. This paper includes an overview of selected methods of subjective listening measurements (ACR, DCR) recommended by ITU-T. The influence of a bit-rate value on the sound quality was a subject of research presented in this paper. The influence of the Spectral Band Replication (SBR) process on the speech quality was also investigated. The tested samples were taken from the Digital Audio Broadcasting experimental emission in Poland as well as from an internet network. The subjective assessment for DAB speech signals has been performed with the use of both: ACR and DCR methods. It turned out that SBR process influences significantly the speech quality at the lower bit-rates making it as good as for higher bit-rates. It was also found that for higher bit-rate values (96 kbit/s, or higher), the use of both methods causes the different results.
Convention Paper 8829 (Purchase now)
P5-3 Investigation on Objective Quality Evaluation for Heavily Distorted Speech—Mitsunori Mizumachi, Kyushu Institute of Technology - Kitakyushu, Fukuoka, Japan
Demand for evaluating speech quality is on the increase. It is advisable for evaluating the speech quality to employ the common objective measure for the wide variety of adverse speech signals. Unfortunately, current speech quality measures do not suit for heavily distorted speech signals. In this paper both the applicability and the limit of the perceptual evaluation of speech quality (PESQ) are investigated compared with the subjective mean opinion score (MOS) for noise-added and noise-reduced speech signals. It is found that the PEAQs are compatible with the MOSs for the noise-reduced speech signals in the non-stationary noise conditions.
Convention Paper 8830 (Purchase now)
P5-4 Novel 5.1 Downmix Algorithm with Improved Dialogue Intelligibility—Kuba Lopatka, Gdansk University of Technology - Gdansk, Poland; Bartosz Kunka, Gdansk University of Technology - Gdansk, Poland; Andrzej Czyzewski, Gdansk University of Technology - Gdansk, Poland
A new algorithm for 5.1 to stereo downmix is introduced that addresses the problem of dialogue intelligibility. The algorithm utilizes proposed signal processing algorithms to enhance the intelligibility of movie dialogue, especially in difficult listening conditions or in compromised speaker setup. To account for the latter, a playback configuration utilizing a portable device, i.e., an ultrabook, is examined. The experiments are presented that confirm the efficiency of the introduced method. Both objective measurements and subjective listening tests were conducted. The new downmix algorithm is compared to the output of a standard downmix matrix method. The results of subjective tests prove that an improved dialogue intelligibility is achieved.
Convention Paper 8831 (Purchase now)
P5-5 Monaural Speech Source Separation by Estimating the Power Spectrum Using Multi-Frequency Harmonic Product Spectrum—David Ayllon, University of Alcala - Alcalá de Henares, Spain; Roberto Gil-Pita, University of Alcalá - Alcalá de Henares, Spain; Manuel Rosa-Zurera, University of Alcala - Alcalá de Henares, Spain
This paper proposes an algorithm to perform monaural speech source separation by means of time-frequency masking. The algorithm is based on the estimation of the power spectrum of the original speech signals as a combination of a carrier signal multiplied by an envelope. A Multi-Frequency Harmonic Product Spectrum (MF-HPS) algorithm is used to estimate the fundamental frequency of the signals in the mixture. These frequencies are used to estimate both the carrier and the envelope from the mixture. Binary masks are generated comparing the estimated spectra of the signals. Results show an important improvement in the separation in comparison to the original algorithm that only uses the information from the HPS.
Convention Paper 8832 (Purchase now)
P5-6 The Effectiveness of Speech Transmission Index (STI) in Accounting for the Effects of Multiple Arrivals—Timothy J. Ryan, Webster University - St. Louis, MO, USA; Richard King, McGill University - Montreal, Quebec, Canada; The Centre for Interdisciplinary Research in Music Media and Technology - Montreal, Quebec, Canada; Jonas Braasch, Rensselaer Polytechnic Institute - Troy, NY, USA; William L. Martens, University of Sydney - Sydney, NSW, Australia
The authors conducted concurrent experiments employing subjective evaluation methods to examine the effects of the manipulation of several sound system design and optimization parameters on the intelligibility of reinforced speech. During the course of these experiments, objective testing methods were also employed to measure the Speech Transmission Index (STI) associated with each of the variable treatments used. Included in this paper is a comparison of the results of these two testing methods. The results indicate that, while STI is capable of detecting many effects of multiple arrivals, it appears to overestimate the degradation to intelligibility caused by multiple arrivals with short delay times.
Convention Paper 8833 (Purchase now)
P5-7 Introducing Synchronization of Speech Mixtures in Blind Sparse Separation Problems—Cosme Llerena, University of Alcalá - Alcala de Henares (Madrid), Spain; Lorena Álvarez, University of Alcalá - Alcalá de Henares, Spain; Roberto Gil-Pita, University of Alcalá - Alcalá de Henares, Spain; Manuel Rosa-Zurera, University of Alcala - Alcalá de Henares, Spain
This paper explores the feasibility of using synchronization of speech mixtures prior to blind sparse source separation methods in order to improve their results. Broadly, methods that assume sparse sources use level and phase differences between mixtures as their features, and they separate signals from them. If each mixture is considerably delayed with respect to the rest of them, the information extracted from these differences can be wrong. With this idea in mind, this paper will focus on using Time Delay Estimation algorithms in order to synchronize the mixtures and observing the improvement that it provokes in a Blind Sparse Source Separation algorithm. The results obtained show the feasibility of using synchronization of the speech mixtures.
Convention Paper 8834 (Purchase now)
P5-8 An Embedded-Processor Driven Test Bench for Acoustic Feedback Cancellation in Real Environments—Francesco Faccenda, Universitá Politecnica delle Marche - Ancona, Italy; Stefano Squartini, Università Politecnica delle Marche - Ancona, Italy; Emanuele Principi, Universitá Politecnica delle Marche - Ancona, Italy; Leonardo Gabrielli, Universitá Politecnica delle Marche - Ancona, Italy; Francesco Piazza, Universitá Politecnica della Marche - Ancona (AN), Italy
In order to facilitate the communication among speakers, speech reinforcement systems equipped with microphones and loudspeakers are employed. Due to the acoustic couplings between them, the speech intelligibility may result ruined and, moreover, high channel gains could drive the system to instability. Acoustic Feedback Cancellation (AFC) methods need to be applied to keep the system stable. In this paper a new Test Bench for testing AFC algorithms in real environments is proposed. It is based on the TMS320C6748 processor, running the Suppressor-PEM algorithm, a recent technique based on the PEM-AFROW paradigm. The partitioned block frequency domain adaptive filter (PB-FDAF) paradigm has been adopted to keep the computational complexity low. A professional sound card and a PC, where an automatic gain controller has been implemented to prevent signal clipping, complete the framework. Several experimental tests confirmed the framework suitability to operate under diverse acoustic conditions.
Convention Paper 8835 (Purchase now)