首页> 外文会议>International Conference on Affective Computing and Intelligent Interaction(ACII 2007); 20070912-14; Lisbon(PT) >Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing
【24h】

Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing

机译:帧与转弯水平:考虑静态和动态处理的语音情感识别

获取原文
获取原文并翻译 | 示例

摘要

Opposing the pre-dominant turn-wise statistics of acoustic Low-Level-Descriptors followed by static classification we re-investigate dynamic modeling directly on the frame-level in speech-based emotion recognition. This seems beneficial, as it is well known that important information on temporal sub-turn-layers exists. And, most promisingly, we integrate this frame-level information within a state-of-the-art large-feature-space emotion recognition engine. In order to investigate frame-level processing we employ a typical speaker-recognition setup tailored for the use of emotion classification. That is a GMM for classification and MFCC plus speed and acceleration coefficients as features. We thereby also consider use of multiple states, respectively an HMM. In order to fuse this information with turn-based modeling, output scores are added to a super-vector combined with static acoustic features. Thereby a variety of Low-Level-Descriptors and functionals to cover prosodic, speech quality, and articulatory aspects are considered. Starting from 1.4k features we select optimal configurations including and excluding GMM information. The final decision task is realized by use of SVM. Extensive test-runs are carried out on two popular public databases, namely EMO-DB and SUSAS, to investigate acted and spontaneous data. As we face the current challenge of speaker-independent analysis we also discuss benefits arising from speaker normalization. The results obtained clearly emphasize the superior power of integrated diverse time-levels.
机译:反对声学低级描述符的主要反演统计,然后进行静态分类,我们直接在基于语音的情感识别中在帧级上重新研究了动态建模。这似乎是有益的,因为众所周知,存在有关时间子转弯层的重要信息。而且,最有希望的是,我们将这些帧级信息集成到了最新的大型功能空间情感识别引擎中。为了研究帧级处理,我们采用了针对情感分类而量身定制的典型说话人识别设置。这是用于分类和MFCC加上速度和加速度系数作为功能的GMM。因此,我们还考虑使用多个状态,分别是HMM。为了使此信息与基于回合的建模融合,将输出分数添加到结合​​了静态声学特征的超向量中。因此,考虑了涵盖字幕,语音质量和发音方面的各种低级描述符和功能。从1.4k功能开始,我们选择最佳配置,包括和排除GMM信息。最终决策任务是通过使用SVM来实现的。在两个流行的公共数据库EMO-DB和SUSAS上进行了广泛的测试,以调查实际和自发的数据。当我们面对说话人独立分析的当前挑战时,我们还将讨论说话人标准化带来的好处。所获得的结果清楚地强调了集成各种时间级别的优势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号