Friday, September 30, 3:15 pm — 4:45 pm (Rm 403B)
P16-1 Analysis of Binaural Features for Supervised Localization in Reverberant Environments—Jiance Ding, Chinese Academy of Science - Beijing, China; University of Chinese Academy of Sciences - Beijing, China; Jie Wang, Guangzhou University - Guangzhou, China; Chengshi Zheng, Chinese Academy of Sciences - Beijing, China; Chinese Academy of Sciences - Shanghai, China; Renhua Peng, Chinese Academy of Sciences - Beijing, China; Xiaodong Li, Chinese Academy of Sciences - Beijing, China; Chinese Academy of Sciences - Shanghai, China
Recent research on supervised binaural sound source localization methods shows that the performance is promising even in reverberant environments when the training and testing environments can match perfectly. However, these supervised methods may still suffer somewhat performance degradation when the intensity of the reverberation increases markedly. This paper studies the impact of reverberation on binaural features theoretically. This study reveals that reverberation is a major factor in reducing the accuracy of supervised binaural localization. Accordingly, we use a binaural dereverberation algorithm to reduce the effect of reverberation and thus to improve the performance of the existing supervised binaural localization methods. Experimental results demonstrate that dereverberation can improve the localization accuracy of these supervised binaural localization methods in reverberant environments.
Convention Paper 9642 (Purchase now)
P16-2 Acoustic Echo Cancellation for Asynchronous Systems Based on Resampling Adaptive Filter Coefficients—Yang Cui, Chinese Academy of Sciences - Beijing, China; Univeresity of Chinese Academy of Sciences - Beijing, China; Jie Wang, Guangzhou University - Guangzhou, China; Chengshi Zheng, Chinese Academy of Sciences - Beijing, China; Chinese Academy of Sciences - Shanghai, China; Xiaodong Li, Chinese Academy of Sciences - Beijing, China; Chinese Academy of Sciences - Shanghai, China
In asynchronous systems, most of traditional acoustic echo cancellation (AEC) algorithms couldn’t track echo path correctly because of the asynchronization of D/A and A/D converters, which can reduce the performance dramatically. Based on multirate digital signal processing theory, this paper proposes to solve this problem by resampling adaptive filter coefficients (RAFC), where the adaptive filter coefficients are updated by normalized least mean square (NLMS) algorithm with a variable step control method. The simulation results indicate that the proposed can estimate the clock offset quite accurately. Objective test results also show that the proposed RAFCNLMS is much better than the previous adaptive sampling rate correction algorithm in terms of the convergence rate and clock offset tracking performance.
Convention Paper 9643 (Purchase now)
P16-3 Single-Channel Speech Enhancement Based on Reassigned Spectrogram—Jie Wang, Guangzhou University - Guangzhou, China; Chengcheng Yang, Guangzhou University - Guangzhou, China; Chunliang Zhang, Guangzhou University - Guangzhou, China; Renhua Peng, Chinese Academy of Sciences - Beijing, China
Most of the traditional a priori SNR estimators, such as the decision-directed approach and its improved versions, only consider the correlation of adjacent frames. Whereas, it is well-known that voiced speech is a typical harmonic signal that results in strong correlation of harmonics. We can expect that the a priori SNR estimator can be improved if the correlation of adjacent frames and harmonics can be used simultaneously. With this motivation, we propose to use the reassigned spectrogram (RS) to control the forgetting factor of the decision-directed approach. Experimental results indicate that the proposed RS-based SNR estimator is much better than the traditional decision-directed approach.
Convention Paper 9644 (Purchase now)
P16-4 The a Priori SNR Estimator Based on Cepstral Processing—Jie Wang, Guangzhou University - Guangzhou, China; Guangquan Yang, Guangzhou University - Guangzhou, China; JingJing Liu, Guangzhou University - Guangzhou, China; Renhua Peng, Chinese Academy of Sciences - Beijing, China
For single-channel speech enhancement systems, the a priori SNR is a key parameter for Wiener-type algorithms. The a priori SNR estimators can reduce the noise efficiently when the noise power spectral density (NPSD) can be estimated accurately. However, when the NPSD is overestimated/underestimated, the a priori SNR may lead to the speech distortion and the residual noise. To solve this problem, this paper proposes to estimate the a priori SNR based on cepstral processing, which not only can suppress harmonic speech components in the noisy speech segments, but also can reduce strong noise components in noise-only segments. Simulation results show that the proposed algorithm has better performance than the traditional DD and Plapous’s two-step algorithms.
Convention Paper 9645 (Purchase now)
P16-5 Quantitative Analysis of Masking in Multitrack Mixes Using Loudness Loss—Gordon Wichern, iZotope, Inc. - Cambridge, MA, USA; Hannah Robertson, iZotope - Cambridge, MA, USA; Aaron Wishnick, iZotope - Cambridge, MA, USA
The reduction of auditory masking is a crucial objective when mixing multitrack audio and is typically achieved through manipulation of gain, equalization, and/or panning for each stem in a mix. However, some amount of masking is unavoidable, acceptable, or even desirable in certain situations. Current automatic mixing approaches often focus on the reduction of masking in general, rather than focusing on particularly problematic masking. As a first step in focusing the attention of automatic masking reduction algorithms on problematic rather than known and accepted masking, we use psychoacoustic masking models to analyze multitrack mixes produced by experienced audio engineers. We measure masking in terms of loudness loss and present problematic masking as outliers (values above the 95th percentile) in instrument and frequency-dependent distributions.
Convention Paper 9646 (Purchase now)
P16-6 Log Complex Color for Visual Pattern Recognition of Total Sound—Stephen Wedekind, University of Missouri - St. Louis - St. Louis, MO, USA; P. Fraundorf, University of Missouri - St.Louis - St. Louis, MO, USA
While traditional audio visualization methods depict amplitude intensities vs. time, such as in a time-frequency spectrogram, and while some may use complex phase information to augment the amplitude representation, such as in a reassigned spectrogram, the phase data are not generally represented in their own right. By plotting amplitude intensity as brightness/saturation and phase-cycles as hue-variations, our complex spectrogram method displays both amplitude and phase information simultaneously, making such images canonical visual representations of the source wave. As a result, the original sound may be reconstructed (down to the original phases) from an image, simply by reversing our process. This allows humans to apply our highly-developed visual pattern recognition skills to complete audio data in new way.
Convention Paper 9647 (Purchase now)
P16-7 Material for Automatic Phonetic Transcription of Speech Recorded in Various Conditions—Bozena Kostek, Gdansk University of Technology - Gdansk, Poland; Audio Acoustics Lab.; Magdalena Plewa, Gdansk University of Technology - Gdansk, Poland; Andrzej Czyzewski, Gdansk University of Technology - Gdansk, Poland
Automatic speech recognition (ASR) is under constant development, especially in cases when speech is casually produced or it is acquired in various environment conditions, or in the presence of background noise. Phonetic transcription is an important step in the process of full speech recognition and is discussed in the presented work as the main focus in this process. ASR is widely implemented in mobile devices technology, but the need is also encountered in applications such as automatic recognition of speech in movies for non-native speakers, for impaired users, and as a support for multimedia systems. This work contains an attempt to analyze speech recorded in various conditions. First, audio and video recordings of specially constructed list of words in English were prepared in order to perform dedicated audio and video analyses in the future stages of the research aiming at audio-visual speech recognition systems (AVSR) development. A dataset of audio-video recordings was prepared and examples of analyses are described in the paper.
Convention Paper 9648 (Purchase now)