Session L: LOW BIT RATE AUDIO CODING - PART 2
Sunday, May 12, 14:00 18:00 h MDCT based perceptual audio coders shape the quantization noise according to simple psychoacoustic rules and general behavioral aspects of the audio signal such as stationarity and tonality. As a consequence, the resulting compressed audio representation has little semantic value making difficult MPEG-7 oriented operations such as feature extraction and audio modification directly in the compressed domain. First results in this perspective are reported using an enhanced version of an MDCT based perceptual coder that implements sinusoidal modeling and subtraction directly in the MDCT frequency domain, as well as spectral envelope modeling and normalization. The implications on the coding efficiency are also addressed. The application of the matching pursuit algorithm for extracting sinusoidal components and transients from audio signals is proposed. The resulting residue is perceptually modeled as a noise like signal. This multi-part model (Sines + Transients + Noise) is used for audio coding purposes. First of all, an accurate detection of transients in audio signals is required. When a transient is detected, energy-adapted matching pursuits are accomplished using a wavelet-packet based dictionary and a dictionary of sinusoidal functions. Otherwise, the matching pursuit algorithm is only applied with the harmonic dictionary. In both cases, the resulting residue is then modeled as a noise-like signal using the Equivalent Rectangular Bandwidth (ERB) model. The parameters of this multi-part model are efficiently quantized, taking into account psycho-acoustical information, so as to assure high perceptual quality at low bit rates. The combination of these all ideas results in nearly transparent audio coding at binary rates lower than 32 kbps for most of the CD-quality one channel audio signals considered for testing. Total Least Squares (TLS) algorithms automatically decompose (audio) frames into a number of exponentially damped sinusoids. This can provide for more efficient modeling than plain sinusoidal modeling, especially in the case of transitional frames. Straightforward implementations of TLS optimize a SNR criterion. In our implementation we apply TLS in a sub band scheme in which the number of damped sinusoids is both frame and sub band dependent. This is made possible through the use of perceptual information provided by the MPEG-I psycho-acoustic model I. Experiments on different audio tracks provide proof of concept for our perceptual ESM, and illustrate the significant reduction in modeling components compared to a non-perceptual ESM. High order linear predictive coding (LPC) analysis, as a pre-preprocessing stage in an audio codec designed for wideband arbitrary audio signals, is found to be particularly beneficial for audio samples of an instrumental nature compared to that of a vocal nature. With increasing LPC orders, it is imperative to keep the bits consumed by the pre-processing stage constant as a proportion of the total bit rate. To achieve this, the properties of the Line Spectrum Pairs (LSP) parameter are exploited in a proposed multistage vector quantization scheme for high order LPC. Notably, incorporating LSP differences in the design of the quantizer was the most efficient, with no perceptible differences at an average of 1.645 bits/sample, compared to the case of scalar quantization, which is used as a benchmark at 2 bits/sample. Particularly, using LSP differences as a bit allocation mechanism proves to be especially effective in dealing with clips of a percussive nature. A low delay variable bit rate audio codec, implemented for wideband arbitrary audio signals, combines inter-frame and intra-frame bit allocation in an adaptive scheme. An outer-loop uses a moving average noise-to-mask ratio (NMR) indicator and a bit reserve to adaptively allocate bits from frames of a lesser perceptual significance to frames of a greater perceptual significance. An inner loop allocates the available bits to each line of the spectrum via an adaptive algorithm based on a weighting function derived from the masking thresholds. Through informal listening tests, the proposed new bit allocation method resulted in an improvement in audio quality over most samples, as opposed to one using a single adaptive intra-frame loop. Particularly, these improvements were more perceptible at the lower bit rates of about 36 kbps as opposed to the higher bit rates of about 64 kbps. Numerical results also indicate a savings of 8 10% of the total bit rate. Binaural Cue Coding (BCC) is an efficient representation for spatial audio that can be applied to stereo and multi-channel audio compression. Conventional mono audio coders are enhanced with BCC for coding of stereo and multi-channel audio signals. There is only a relatively small overhead in bit rate for encoding stereo and multi-channel audio signals compared to the bit rate of the mono audio coder alone. The presented implementations have low complexity and are suitable for real-time applications. Results from subjective tests suggest that the proposed scheme provides better audio quality for encoding of stereo audio signals than conventional perceptual transform audio coders for a wide range of bitrates. Intensity Stereo Coding (ISC) is a joint-channel audio coding tool that is part of the ISO/MPEG standards. ISC can introduce severe distortions if applied to full bandwidth or to audio signals with a dynamic or wide spatial image. In contrast, Binaural Cue Coding (BCC) is a systematic approach for representing auditory spatial cues which includes ISC as a subset. BCC is independent of the time/frequency resolution used by the coder, thus it can be optimized for spatial image reproduction. Subjective listening tests confirm that ISC is significantly compromised by an inappropriate time/frequency resolution and that BCC has superior quality and robustness. Perceptual audio coding of high quality audio signals is nowadays widely used. To reproduce the audio data, the bitstream is expanded into an uncompressed audio format by the decoding algorithm. As shown previously, it is feasible to recover the encoding compression parameters from the decoded audio signal and even translate a decoded audio signal back into its original bitstream representation. This technique is referred to as inverse decoding and has several interesting applications, including tandem-resistant re-encoding of audio signals. The paper illustrates practical results obtained by the first working implementation of an Inverse Decoder based on the popular MP3 coder. The performance of the algorithm is evaluated in terms of reconstruction precision and computational complexity. Finally, algorithmic issues are discussed. |
|