Representing signals using only timing information and feature extraction for automatic speech recognition

Yadong Wang, University of Rhode Island


Acoustic signals are separated into frequency components by filtering in the inner ear and represented by streams of spike trains in the auditory nerve. Typically, the spikes are all-or-none stereotyped waveforms, so all the information represented by the spikes is encoded in their timing. This raises the question: how can band-pass signals be represented by timing information only? ^ It is well known that only a special class of bandpass signals, called Real-Zero(RZ) signals can be uniquely represented by their zero-crossings. However, it is possible to invertibly map arbitrary bandpass signals into RZ signals, thereby, implicitly represent the bandpass signal using the mapped RZ signal's zero-crossings. This mapping is known as Real-Zero Conversion(RZC). In this dissertation a class of novel signal-adaptive RZC algorithms is proposed. Specifically, algorithms to convert an arbitrary bandpass signal into other signals, whose zero-crossings contain sufficient information to represent the bandpass signal's phase and envelope are presented. Since the proposed zero-crossings are not those of the original signal, but only indirectly related to it, they are called hidden or Covert Zero C&barbelow;rossings (CoZeCs). Rational signal functions of the complex-time variable ξ are reviewed. The proposed algorithms are used to represent speech signals processed through an analysis filter bank and it is shown that they can be reconstructed given the CoZeCs. ^ Based on this model, a novel approach to extract noise robust features for speech recognition is developed in the second part of this dissertation. Speech was processed first by a bank of band-pass filters. At the output of the band-pass filters the signal is subjected to a log-derivative operation which naturally decomposes the band-pass signal into analytic and anti-analytic components. The average instantaneous frequency (AIF) and average log-envelope (ALE) are then extracted as coarse features at the output of each filter. Further refined features may also be extracted—a sequential hierarchy of modulation analysis—from the analytic and anti-analytic components. Speech recognition experiments with the Aurora 2 task was performed. For clean training, (compared to the mel-cepstrum front end) the AIF/ALE front end achieves an overall improvement of 7.97% in accuracy rates. ^

Subject Area

Engineering, Electronics and Electrical

Recommended Citation

Yadong Wang, "Representing signals using only timing information and feature extraction for automatic speech recognition" (2003). Dissertations and Master's Theses (Campus Access). Paper AAI3115640.