Over time, reliable speech communication and recognition systems will become increasingly bimodal, where both audio and visual information will be captured, transmitted or stored, and processed. The interaction between different research areas using audio and visual information has opened a door to many bimodal applications.; Different ways of utilizing this interaction are explored in the work presented in this thesis, focusing on the extraction of MPEG-4 compliant Facial Animation Parameters (FAPs), utilization of such parameters for robust audio-visual speech recognition, speech driven facial animation, audio-visual person recognition, and automatic facial expression. MPEG-4 is expected to become a dominant standard in a number of applications and, therefore, working within its framework adds to the usefulness and applicability of this work.; A novel automatic and robust visual feature extraction approach that combines active contour and deformable templates algorithms and does not require prior knowledge about the data, extensive computational training, or hand labeling is developed in this work.; The audio-visual continuous speech recognition system developed, significantly improves speech recognition performance over a wide range of acoustic noise levels, for different dimensionalities of visual features. The speech recognition experiments were performed on a relatively large vocabulary audio-visual database. The improvement in ASR performance that can be obtained by exploiting the visual speech information contained in outer- and inner-lip FAPs was determined.; The developed HMM-based speech-to-video synthesis system integrates acoustic HMMs (AHMMs) and visual HMMs (VHMMs). This approach allows independent modeling of acoustic and visual signals. The acoustic state sequence is mapped into a visual state sequence using the correlation HMM (CHMM) system. The resulting visual state sequence is used to produce sequence of visual observations (FAPs). The performance of the system was evaluated through several objective experiments. The experiments showed that the proposed speech-to-video synthesis system significantly reduces time-alignment errors compared to the conventional temporal scaling method. The objective FAP comparison results confirmed the strong similarity between the synthesized FAPs and the original FAPs.; In addition, audio-visual person verification and automatic facial expression recognition systems are also developed and described in this thesis.
展开▼