AES Warsaw 2015
Poster Session P7
P7 - (Poster) Audio Signal Processing
Friday, May 8, 11:00 — 13:00 (Foyer)
P7-1 Feature Learning for Classifying Drum Components from Nonnegative Matrix Factorization—Matthias Leimeister, Native Instruments GmbH - Berlin, Germany
This paper explores automatic feature learning methods to classify percussive components in nonnegative matrix factorization (NMF). To circumvent the necessity of designing appropriate spectral and temporal features for component clustering, as usually used in NMF-based transcription systems, multilayer perceptrons and deep belief networks are trained directly on the factorization of a large number of isolated samples of kick and snare drums. The learned features are then used to assign components resulting from the analysis of polyphonic music to the different drum classes and retrieve the temporal activation curves. The evaluation on a set of 145 excerpts of polyphonic music shows that the algorithms can efficiently classify drum components and compare favorably to a classic “bag-of-features” approach using support vector machines and spectral mid-level features.
Convention Paper 9261 (Purchase now)
P7-2 Blind Bandwidth Extension System Utilizing Advanced Spectral Envelope Predictor—Kihyun Choo, Samsung Electronics Co., Ltd. - Suwon, Korea; Anton Porov, Samsung R&D Institute Russia - Moscow, Russia; ITMO University - Saint-Petersburg, Russia; Eunmi Oh, Samsung Electronics Co., Ltd. - Suwon, Korea
We propose a blind bandwidth extension (BWE) technique that improves the quality of a narrow-band speech signal using time domain extension and spectral envelope prediction in the frequency domain. In the time domain, we use a spectral double shifting method. Further, a new spectral envelope predictor is introduced in the frequency domain. We observe less distortion when the attribute is transferred from low to high frequency, instead of reflecting the original high band. The proposed blind BWE system is applied to the decoded output of an adaptive multi-rate (AMR) codec at 12.2 kbps to generate a high-frequency spectrum from 4 to 8 kHz. The blind BWE was objectively evaluated with the AMR and AMR wideband codecs and subjectively evaluated by comparing it with the AMR.
Convention Paper 9262 (Purchase now)
P7-3 Time Domain Extrapolative Packet Loss Concealment for MDCT Based Voice Codec—Shen Huang, Dolby Laboratories - Beijing, China; Xuejing Sun, Dolby Laboratories - Beijing, China
A novel low latency packet loss concealment technique for transform-based codecs is proposed. The algorithm combines signals from Inverse Modulated Discrete Cosine Transform (IMDCT) domain and the previous reconstructed signal from time domain with aligned phase, with which a pitch-synchronized concealment is performed. This minimizes aliasing artifacts that occur in MDCT domain concealment for voiced speech signals. For unvoiced speech, speech-shaped comfort noise is inserted. When there is a burst loss, a position-dependent concealment process is performed for different stages of packet losses. Subjective listening tests using both naïve and expert listeners suggest that the proposed algorithm generates fewer artifacts and offers significantly better performance against legacy packet repetition based approaches.
Convention Paper 9263 (Purchase now)
P7-4 Scalable Parametric Audio Coder Using Sparse Approximation with Frame-to-Frame Perceptually Optimized Wavelet Packet Based Dictionary—Alexey Petrovsky, Belarusian State University of Informatics and Radioelectronics - Minsk, Belarus; Vadzim Herasimovich, Belarusian State University of Informatics and Radioelectronics - Minsk, Belarus; Alexander Petrovsky, Belarusian State University of Informatics and Radioelectronics - Minsk, Belarus
This paper is devoted to the development of a scalable parametric audio coder based on a matching pursuit algorithm with a frame-based psychoacoustic optimized wavelet packet dictionary. The main idea is to parameterize audio signal with a minimum number of non-negative elements. This can be done by applying sparse approximation such as matching pursuit algorithm. In contrast with current approaches in audio coding based on sparse approximation we introduce a model of dynamic dictionary forming for each frame of input audio signal individually based on wavelet packet decomposition and dynamic wavelet packet tree transformation with psychoacoustic model. Experimental results of developed encoder and comparison with modern popular audio encoders are provided.
Convention Paper 9264 (Purchase now)
P7-5 General-Purpose Listening Enhancement Based on Subband Non-Linear Amplification with Psychoacoustic Criterion—Elias Azarov, Belarusian State University of Informatics and Radioelectronics - Minsk, Belarus; Maxim Vashkevich, Belarusian State University of Informatics and Radioelectronics - Minsk, Belarus; Vadzim Herasimovich, Belarusian State University of Informatics and Radioelectronics - Minsk, Belarus; Alexander Petrovsky, Belarusian State University of Informatics and Radioelectronics - Minsk, Belarus
Near end listening enhancement is an effective approach for speech intelligibility improvement in noisy conditions that is applied mainly for telecommunications. However potential application field of the concept of near end listening enhancement is much wider and can be extended for listening of any audio content (including music and other sounds) in quiet and noisy conditions. This paper proposes an algorithm for near end listening enhancement designed for processing both speech and music that, considering subjective listening tests, significantly improves the listening experience. The algorithm is based on subband non-linear amplification of the audio signal in accordance with noise spectral characteristics and personal hearing thresholds of the listener. The algorithm is experimentally implemented as an application for smartphones.
Convention Paper 9265 (Purchase now)
P7-6 Poster Moved to Session P16—N/A
P7-7 Speech Analysis Based on Sinusoidal Model with Time-Varying Parameters—Elias Azarov, Belarusian State University of Informatics and Radioelectronics - Minsk, Belarus; Maxim Vashkevich, Belarusian State University of Informatics and Radioelectronics - Minsk, Belarus; Alexander Petrovsky, Belarusian State University of Informatics and Radioelectronics - Minsk, Belarus
Extracting speech-specific characteristics from a signal such as spectral envelope and pitch is essential for parametrical speech processing. These characteristics are used in many speech applications including coding, parametrical text-to-speech synthesis, voice morphing, and others. This paper presents some original estimation techniques that extract these characteristics using a sinusoidal model of speech with instantaneous parameters. The analysis scheme consists of two steps: first the parameters of sinusoidal model are extracted from the signal, and then these parameters are transformed to required characteristics. Some evaluations of the presented techniques are carried out on synthetic and natural speech signals to show potential of the presented approach.
Convention Paper 9267 (Purchase now)
P7-8 A Low-Delay Algorithm for Instantaneous Pitch Estimation—Elias Azarov, Belarusian State University of Informatics and Radioelectronics - Minsk, Belarus; Maxim Vashkevich, Belarusian State University of Informatics and Radioelectronics - Minsk, Belarus; D. Likhachov, Belarusian State University of Informatics and Radioelectronics - Minsk, Belarus; Alexander Petrovsky, Belarusian State University of Informatics and Radioelectronics - Minsk, Belarus
Estimation of instantaneous pitch provides high accuracy for frequency-modulated pitches and can be beneficial compared to conventional pitch extraction techniques for unsteady voiced sounds. However, applying an estimator of instantaneous pitch to a practical real-time speech processing application is a hard problem because of high computational cost and high inherent delay. The paper presents an algorithm for instantaneous pitch estimation specifically designed for real-time applications. The analysis scheme is based on the robust algorithm for instantaneous pitch tracking (IRAPT) featuring an efficient processing scheme and low inherent delay. The paper presents some evaluation results using synthesized and natural speech signals that illustrate actual performance of the algorithm.
Convention Paper 9268 (Purchase now)
P7-9 Content-Based Music Structure Analysis Using Vector Quantization—Nikolaos Tsipas, Aristotle University of Thessaloniki - Thessaloniki, Greece; Lazaros Vrysis, Aristotle University of Thessaloniki - Thessaloniki, Greece; Playcompass Entertainment; Charalampos A. Dimoulas, Aristotle University of Thessaloniki - Thessaloniki, Greece; George Papanikolaou, Aristotle University of Thessaloniki - Thessaloniki, Greece
Music structure analysis has been one of the challenging problems in the field of music information retrieval during the last decade. Past years advances in the field have contributed toward the establishment and standardization of a framework covering repetition, homogeneity, and novelty based approaches. With this paper an optimized fusion algorithm for transition points detection in musical pieces is proposed, as an extension to existing state-of-the-art techniques. Vector-Quantization is introduced as an adaptive filtering mechanism for time-lag matrices while a structure-features based self-similarity matrix is proposed for novelty detection. The method is evaluated against 124 pop songs from the INRIA Eurovision dataset and performance results are presented in comparison with existing state-of-the-art implementations for music structure analysis.
Convention Paper 9269 (Purchase now)
P7-10 Clock Skew Compensation by Adaptive Resampling for Audio Networking—Leonardo Gabrielli, Universitá Politecnica delle Marche - Ancona, Italy; Michele Bussolotto, Universitá Politecnica della Marche - Ancona, Italy; Stefano Squartini, Università Politecnica delle Marche - Ancona, Italy; Fons Adriaensen, Huawei European Research Center - Munich, Germany
Wired Audio Networking is an established practice since years based on both proprietary solutions or open hardware and protocols. One of the most cost-effective solutions is the use of a general purpose IEEE 802.3 infrastructure and personal computers running IP based protocols. One obvious shortcoming of such setups is the lack of synchronization at the audio level and the presence of a network delay affected by jitter. Two approaches to sustain a continuous audio flow are described, implemented by the authors in open source projects based on a relative and absolute time adaptive resampling. A description of the mechanisms is provided along with simulated and measured results, which show the validity of both approaches.
Convention Paper 9270 (Purchase now)
P7-11 Analysis of Onset Detection with a Maximum Filter in Recordings of Bowed Instruments—Bartlomiej Stasiak, Lodz University of Technology - Lodz, Poland; Jedrzej Monko, Lódz University of Technology - Lódz, Poland
This work presents a new approach to assessment of the quality of onset detection functions on the example of bowed instruments recordings. Using this method, we test a vibrato suppression technique based on a maximum filter. The results, obtained with the aid of a specially constructed database of audio recordings, reveal problems connected with certain qualities of the sound signal generated by a bowed instrument and with the effectiveness of the onset detection process.
Convention Paper 9271 (Purchase now)
P7-12 An FPGA-Based Virtual Reality Audio System—Wolfgang Fohl, Hamburg University of Applied Sciences - Hamburg, Germany; David Hemmer, Hamburg University of Applied Sciences - Hamburg, Germany
A distributed system for mobile virtual reality audio is presented. The system consists of an audio server running on a PC or Mac, a remote control app for an iOS6 device, and the mobile renderer running on a system-on-chip (SoC) with a CPU core and signal processing hardware. The server communicates with the renderer via WLAN. It sends audio streams over a self-defined lightweight protocol and exchanges status and control data as OSC (Open Sound Control) messages. On the mobile renderer, HRTF filters are applied to each audio signal according to the relative positions of the source and the listener’s head. The complete audio signal processing chain has been designed in Simulink. The VHDL code for the SoC’s FPGA hardware has been automatically generated by Xilinx’s System Generator. The system is capable of rendering up to eight independent virtual sources.
Convention Paper 9328 (Purchase now)