AES New York 2017
Poster Session P08
P08 - Signal Processing
Thursday, October 19, 11:00 am — 12:30 pm (Poster Area)
P08-1 A Simplified 2-Layer Text-Dependent Speaker Authentication System—Giacomo Valenti, NXP Software - Mougins, France; EURECOM - Biot, France; Adrien Daniel, NXP Software - Mougins, France; Nicholas Evans, EURECOM - Sophia Antipolis, France
This paper describes a variation of the well-known HiLAM approach to speaker authentication that enables reliable text-dependent speaker recognition with short-duration enrollment. The modifications introduced in this system eliminate the need for an intermediate text-independent speaker model. While the simplified system is admittedly a modest modification to the original work, it delivers comparable levels of automatic speaker verification performance while requiring 97% less speaker enrollment data. Such a significant reduction in enrollment data improves usability and supports speaker authentication for smart device and Internet of Things applications.
Convention Paper 9844 (Purchase now)
P08-2 Binaural Sound Source Separation Based on Directional Power Spectral Densities—Joel Augusto Luft, Instituto Federal de Educação, Ciência e Tecnologia do Rio Grande do Sul - Canoas, RS, Brazil; Universidade Federal do rio Grande do Sul - Porto Alegre, RS, Brazil; Fabio I. Pereira, Federal University of Rio Grande do Sul - Porto Alegre, Brazil; Altamiro Susin, Federal University of Rio Grande do Sul - Porto Alegre, Brazil
Microphone arrays are a common choice to be used in spatial sound source separation. In this paper a new method for binaural source separation is presented. The separation is performed using the spatial position of sound source, the Head-Related Transfer Function, and the Power Spectral Density of fixed beamformers. A non-negative constrained least-squares minimization approach is used to solve the Head-Related Transfer Function based directivity gain formulation and the Power Spectral Density is used as a magnitude estimation of the sound sources. Simulation examples are presented to demonstrate the performance of the proposed algorithm.
Convention Paper 9845 (Purchase now)
P08-3 Improving Neural Net Auto Encoders for Music Synthesis—Joseph Colonel, The Cooper Union for the Advancement of Science and Art - New York, NY, USA; Christopher Curro, The Cooper Union for the Advancement of Science and Art - New York, NY, USA; Sam Keene, The Cooper Union for the Advancement of Science and Art - New York, NY, USA
We present a novel architecture for a synthesizer based on an autoencoder that compresses and reconstructs magnitude short time Fourier transform frames. This architecture outperforms previous topologies by using improved regularization, employing several activation functions, creating a focused training corpus, and implementing the Adam learning method. By multiplying gains to the hidden layer, users can alter the autoencoder’s output, which opens up a palette of sounds unavailable to additive/subtractive synthesizers. Furthermore, our architecture can be quickly re-trained on any sound domain, making it flexible for music synthesis applications. Samples of the autoencoder’s outputs can be found at http://soundcloud.com/ann_synth , and the code used to generate and train the autoencoder is open source, hosted at http://github.com/JTColonel/ann_synth.
Convention Paper 9846 (Purchase now)
P08-4 Comparative Study of Self-Organizing Maps vs Subjective Evaluation of Quality of Allophone Pronunciation for Non-native English Speakers—Bozena Kostek, Gdansk University of Technology - Gdansk, Poland; Audio Acoustics Lab.; Magdalena Piotrowska, Gdansk University of Technology - Gdansk, Poland; Tomasz Ciszewski, University of Gdansk - Gdansk, Poland; Andrzej Czyzewski, Gdansk University of Technology - Gdansk, Poland
The purpose of this study was to apply Self-Organizing Maps to differentiate between the correct and the incorrect allophone pronunciations and to compare the results with subjective evaluation. Recordings of a list of target words, containing selected allophones of English plosive consonants, the velar nasal and the lateral consonant, were made twice. First, the target words were read from the list by nine non-native speakers and then repeated after a phonology expert’s recorded sample. Afterwards, two recorded signal sets were segmented into allophones and parameterized. For that purpose, a set of descriptors, commonly employed in music information retrieval, was utilized to determine whether they are effective in allophone analysis. The phonology expert’s task was to evaluate the pronunciation accuracy of each uttered allophone. Extracted feature vectors along with the assigned ratings were applied to SOMs.
Convention Paper 9847 (Purchase now)
P08-5 Automatic Masking Reduction in Balance Mixes Using Evolutionary Computing—Nicholas Jillings, Birmingham City University - Birmingham, UK; Ryan Stables, Birmingham City University - Birmingham, UK
Music production is a highly subjective task, which can be difficult to automate. Simple session structures can quickly expose complex mathematical tasks which are difficult to optimize. This paper presents a method for the reduction of masking in an unknown mix using genetic programming. The model uses results from a series of listening tests to guide its cost function. The program then returns a vector that best minimizes this cost. The paper explains the limitations of using such a method for audio as well as validating the results.Music production is a highly subjective task, which can be difficult to automate. Simple session structures can quickly expose complex mathematical tasks which are difficult to optimize. This paper presents a method for the reduction of masking in an unknown mix using genetic programming. The model uses results from a series of listening tests to guide its cost function. The program then returns a vector that best minimizes this cost. The paper explains the limitations of using such a method for audio as well as validating the results.
Convention Paper 9813 (Purchase now)