首页> 外文学位 >Audio-visual interactions in multimodal communications using facial animation parameters.
【24h】

Audio-visual interactions in multimodal communications using facial animation parameters.

机译:使用面部动画参数的多模式通信中的视听交互。

获取原文
获取原文并翻译 | 示例

摘要

Over time, reliable speech communication and recognition systems will become increasingly bimodal, where both audio and visual information will be captured, transmitted or stored, and processed. The interaction between different research areas using audio and visual information has opened a door to many bimodal applications.; Different ways of utilizing this interaction are explored in the work presented in this thesis, focusing on the extraction of MPEG-4 compliant Facial Animation Parameters (FAPs), utilization of such parameters for robust audio-visual speech recognition, speech driven facial animation, audio-visual person recognition, and automatic facial expression. MPEG-4 is expected to become a dominant standard in a number of applications and, therefore, working within its framework adds to the usefulness and applicability of this work.; A novel automatic and robust visual feature extraction approach that combines active contour and deformable templates algorithms and does not require prior knowledge about the data, extensive computational training, or hand labeling is developed in this work.; The audio-visual continuous speech recognition system developed, significantly improves speech recognition performance over a wide range of acoustic noise levels, for different dimensionalities of visual features. The speech recognition experiments were performed on a relatively large vocabulary audio-visual database. The improvement in ASR performance that can be obtained by exploiting the visual speech information contained in outer- and inner-lip FAPs was determined.; The developed HMM-based speech-to-video synthesis system integrates acoustic HMMs (AHMMs) and visual HMMs (VHMMs). This approach allows independent modeling of acoustic and visual signals. The acoustic state sequence is mapped into a visual state sequence using the correlation HMM (CHMM) system. The resulting visual state sequence is used to produce sequence of visual observations (FAPs). The performance of the system was evaluated through several objective experiments. The experiments showed that the proposed speech-to-video synthesis system significantly reduces time-alignment errors compared to the conventional temporal scaling method. The objective FAP comparison results confirmed the strong similarity between the synthesized FAPs and the original FAPs.; In addition, audio-visual person verification and automatic facial expression recognition systems are also developed and described in this thesis.
机译:随着时间的流逝,可靠的语音通信和识别系统将越来越成为双峰的,其中音频和视觉信息都将被捕获,传输或存储以及处理。使用音频和视频信息的不同研究领域之间的交互为许多双峰应用打开了一扇门。在本文提出的工作中,探索了利用这种交互的不同方式,重点是提取符合MPEG-4的面部动画参数(FAP),将这些参数用于健壮的视听语音识别,语音驱动的面部动画,音频-视觉人识别和自动面部表情。 MPEG-4有望成为许多应用程序中的主要标准,因此,在其框架内工作将增加这项工作的实用性和适用性。在这项工作中,开发了一种新颖的自动且鲁棒的视觉特征提取方法,该方法结合了主动轮廓和可变形模板算法,并且不需要有关数据的先验知识,广泛的计算训练或手标记。开发的视听连续语音识别系统可针对各种维度的视觉特征,在很大范围的声噪声水平上显着提高语音识别性能。语音识别实验是在相对较大的词汇视听数据库上进行的。通过利用包含在外唇和内唇FAP中的视觉语音信息,可以确定ASR性能的提高。基于HMM的开发的语音到视频合成系统集成了声学HMM(AHMM)和可视HMM(VHMM)。这种方法允许对声音和视觉信号进行独立建模。使用相关HMM(CHMM)系统将声学状态序列映射到视觉状态序列。所得的视觉状态序列用于产生视觉观察序列(FAP)。通过几个客观实验评估了系统的性能。实验表明,与传统的时间缩放方法相比,所提出的语音视频合成系统显着减少了时间对准误差。 FAP的客观比较结果证实了合成的FAP与原始FAP之间的强烈相似性。此外,本文还开发并描述了视听人员验证和面部表情自动识别系统。

著录项

  • 作者

    Aleksic, Petar S.;

  • 作者单位

    Northwestern University.;

  • 授予单位 Northwestern University.;
  • 学科 Engineering Electronics and Electrical.
  • 学位 Ph.D.
  • 年度 2004
  • 页码 224 p.
  • 总页数 224
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 无线电电子学、电信技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号