AES Paris 2016
Paper Session P2

P2 - Audio Signal Processing—Part 1: Coding, Encoding, and Perception


Saturday, June 4, 09:00 — 12:00 (Room 352B)

Chair:
Dejan Todorovic, Dirigent Acoustics - Belgrade, Serbia

P2-1 Low Complexity, Software Based, High Rate DSD Modulator Using Vector QuantificationThierry Heeb, ISIN-SUPSI - Manno, Switzerland; Digimath - Sainte-Croix, Switzerland; Tiziano Leidi, ISIN-SUPSI - Manno, Switzerland; Diego Frei, ISIN-SUPSI - Manno, Switzerland; Alexandre Lavanchy, Engineered SA - Yverdon-les-Bains, Switzerland
High rate Direct Stream Digital (DSD) is emerging as a format of choice for distribution of high-definition audio content. However, real-time encoding of such streams requires considerable computing resources due to their high sampling rate, constraining implementations to hardware based platforms. In this paper we disclose a new modulator topology allowing for reduction in computational load and making real-time high rate DSD encoding suitable for software based implementation on off-the-shelf Digital Signal Processors (DSPs). We first present the architecture of the proposed modulator and then show results from a practical real-time implementation.
Convention Paper 9489 (Purchase now)

P2-2 Phase Derivative Correction of Bandwidth-Extended Signals for Perceptual Audio CodecsMikko-Ville Laitinen, Aalto University - Espoo, Finland; Sascha Disch, Fraunhofer IIS, Erlangen - Erlangen, Germany; Christopher Oates, Fraunhofer IIS - Erlangen, Germany; Ville Pulkki, Aalto University - Espoo, Finland
Bandwidth extension methods, such as spectral band replication (SBR), are often used in low-bit-rate codecs. They allow transmitting only a relatively narrow low-frequency region alongside with parametric information about the higher bands. The signal for the higher bands is obtained by simply copying it from the transmitted low-frequency region. The copied-up signal is processed by multiplying the magnitude spectrum with suitable gains based on the transmitted parameters to obtain a similar magnitude spectrum as that of the original signal. However, the phase spectrum of the copied-up signal is typically not processed but is directly used. In this paper we describe what are the perceptual consequences of using directly the copied-up phase spectrum. Based on the observed effects, two metrics for detecting the perceptually most significant effects are proposed. Based on these, methods how to correct the phase spectrum are proposed as well as strategies for minimizing the amount of transmitted additional parameter values for performing the correction. Finally, the results of formal listening tests are presented.
Convention Paper 9490 (Purchase now)

P2-3 AC-4 – The Next Generation Audio CodecKristofer Kjörling, Dolby Sweden AB - Stockholm, Sweden; Jonas Rödén, Dolby Sweden AB - Stockholm, Sweden; Martin Wolters, Dolby Germany GmbH - Nuremberg, Germany; Jeff Riedmiller, Dolby Laboratories - San Francisco, CA USA; Arijit Biswas, Dolby Germany GmbH - Nuremberg, Germany; Per Ekstrand, Dolby Sweden AB - Stockholm, Sweden; Alexander Gröschel, Dolby Germany GmbH - Nuremberg, Germany; Per Hedelin, Dolby Sweden AB - Stockholm, Sweden; Toni Hirvonen, Dolby Laboratories - Stockholm, Sweden; Holger Hörich, Dolby Germany GmbH - Nuremberg, Germany; Janusz Klejsa, Dolby Sweden AB - Stockholm, Sweden; Jeroen Koppens, Dolby Sweden AB - Stockholm, Sweden; K. Krauss, Dolby Germany GmbH - Nuremberg, Germany; Heidi-Maria Lehtonen, Dolby Sweden AB - Stockholm, Sweden; Karsten Linzmeier, Dolby Germany GmbH - Nuremberg, Germany; Hannes Muesch, Dolby Laboratories, Inc. - San Francisco, CA, USA; Harald Mundt, Dolby Germany GmbH - Nuremberg, Germany; Scott Norcross, Dolby Laboratories - San Francisco, CA, USA; J. Popp, Dolby Germany GmbH - Nuremberg, Germany; Heiko Purnhagen, Dolby Sweden AB - Stockholm, Sweden; Jonas Samuelsson, Dolby Sweden AB - Stockholm, Sweden; Michael Schug, Dolby Germany GmbH - Nuremberg, Germany; L. Sehlström, Dolby Sweden AB - Stockholm, Sweden; R. Thesing, Dolby Germany GmbH - Nuremberg, Germany; Lars Villemoes, Dolby Sweden - Stockholm, Sweden; Mark Vinton, Dolby - San Francisco, CA, USA
AC-4 is a state of the art audio codec standardized in ETSI (TS103 190 and TS103 190-2) and the TS103 190 is part of the DVB toolbox (TS101 154). AC-4 is an audio codec designed to address the current and future needs of video and audio entertainment services including broadcast and Internet streaming. As such, it incorporates a number of features beyond the traditional audio coding algorithms, such as capabilities to support immersive and personalized audio, support for advanced loudness management, video-frame synchronous coding, dialogue enhancement, etc. This paper will outline the thinking behind the design of the AC-4 codec, explain the different coding tools used, the systemic features included, and give an overview of performance and applications. [Also a poster—see session P5-6]
Convention Paper 9491 (Purchase now)

P2-4 Using Phase Information to Improve the Reconstruction Accuracy in Sinusoidal ModelingClara Hollomey, Glasgow Caledonian University - Glasgow, Scotland, UK; David Moore, Glasgow Caledonian University - Glasgow, Lanarkshire, UK; Don Knox, Glasgow Caledonian University - Glasgow, Scotland, UK; W. Owen Brimijoin, MRC/CSO Institute of Hearing Research - Glasgow, Scotland, UK; William Whitmer, MRC/CSO Institute of Hearing Research - Glasgow, Scotland, UK
Sinusoidal modeling is one of the most common techniques for general purpose audio synthesis and analysis. Owing to the ever increasing amount of available computational resources, nowadays practically all types of sounds can be constructed up to a certain degree of perceptual accuracy. However, the method is computationally expensive and can for some cases, particularly for transient signals, still exceed the available computational resources. In this work methods derived from the realm of machine learning are exploited to provide a simple and efficient means to estimate the achievable reconstruction quality. The peculiarities of common classes of musical instruments are discussed and finally, the existing metrics are extended by information on the signal's phase propagation to allow for more accurate estimations. Also a poster—see session P5-8]
Convention Paper 9492 (Purchase now)

P2-5 Equalization of Spectral Dips Using Detection ThresholdsSunil G. Bharitkar, HP Labs., Inc. - San Francisco, CA, USA; Charles Q. Robinson, Dolby Laboratories - San Francisco, CA, USA; Andrew Poulain, Dolby - San Jose, CA, USA
Frequency response equalization is often performed to improve audio reproduction. Variations from the target system response due to playback equipment or room acoustics can result in perceptible timbre distortion. In the first part of this paper we describe experiments conducted to determine the audibility of artificially introduced spectral dips. In particular, we measured notch depth detection threshold (independent variable) with respect to notch center frequency and Q-factor (independent variables). Listening tests were administered to 10 listeners in a small listening room and a screening room (small cinema with approximately 100 seats). Pink noise was used as the stimulus as it is perceptually flat (with roughly 3 dB/octave spectral tilt with frequency) and is known to be a reliable and discriminating signal for performing timbre judgments. The listeners gave consistent notch depth results with low variability around the mean value. The notch audibility data was then used to develop multiple candidate algorithms that generate equalization curves designed to perceptually match a desired target response, while minimizing the equalization gain applied. Informal subjective results validated the performance of the final algorithm.
Convention Paper 9493 (Purchase now)

P2-6 Single-Channel Audio Source Separation Using Deep Neural Network EnsemblesEmad M. Grais, University of Surrey - Guildford, Surrey, UK; Gerard Roma, University of Surrey - Guildford, Surrey, UK; Andrew J. R. Simpson, University of Surrey - Guildford, Surrey, UK; Mark D. Plumbley, University of Surrey - Guildford, Surrey, UK
Deep neural networks (DNNs) are often used to tackle the single channel source separation (SCSS) problem by predicting time-frequency masks. The predicted masks are then used to separate the sources from the mixed signal. Different types of masks produce separated sources with different levels of distortion and interference. Some types of masks produce separated sources with low distortion, while other masks produce low interference between the separated sources. In this paper a combination of different DNNs’ predictions (masks) is used for SCSS to achieve better quality of the separated sources than using each DNN individually. We train four different DNNs by minimizing four different cost functions to predict four different masks. The first and second DNNs are trained to approximate reference binary and soft masks. The third DNN is trained to predict a mask from the reference sources directly. The last DNN is trained similarly to the third DNN but with an additional discriminative constraint to maximize the differences between the estimated sources. Our experimental results show that combining the predictions of different DNNs achieves separated sources with better quality than using each DNN individually. Also a poster—see session P5-7]
Convention Paper 9494 (Purchase now)


Return to Paper Sessions

EXHIBITION HOURS June 5th   10:00 – 18:00 June 6th   09:00 – 18:00 June 7th   09:00 – 16:00
REGISTRATION DESK June 4th   08:00 – 18:00 June 5th   08:00 – 18:00 June 6th   08:00 – 18:00 June 7th   08:00 – 16:00
TECHNICAL PROGRAM June 4th   09:00 – 18:30 June 5th   08:30 – 18:00 June 6th   08:30 – 18:00 June 7th   08:45 – 16:00
AES - Audio Engineering Society