...
首页> 外文期刊>Computer vision and image understanding >Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs
【24h】

Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs

机译:使用残差网络和LSTM突破视听单词识别的界限

获取原文
获取原文并翻译 | 示例
           

摘要

Visual and audiovisual speech recognition are witnessing a renaissance which is largely due to the advent of deep learning methods. In this paper, we present a deep learning architecture for lipreading and audiovisual word recognition, which combines Residual Networks equipped with spatiotemporal input layers and Bidirectional LSTMs. The lipreading architecture attains 11.92% misclassification rate on the challenging Lipreading-In-The-Wild database, which is composed of excerpts from BBC-TV, each containing one of the 500 target words. Audiovisual experiments are performed using both intermediate and late integration, as well as several types and levels of environmental noise, and notable improvements over the audio-only network are reported, even in the case of clean speech. A further analysis on the utility of target word boundaries is provided, as well as on the capacity of the network in modeling the linguistic context of the target word. Finally, we examine difficult word pairs and discuss how visual information helps towards attaining higher recognition accuracy.
机译:视觉和视听语音识别正在复兴,这在很大程度上是由于深度学习方法的出现。在本文中,我们提出了一种用于唇读和视听单词识别的深度学习架构,该架构结合了配备时空输入层和双向LSTM的残差网络。在具有挑战性的Lipreading-In-The-Wild数据库中,唇读架构的误分类率达到11.92%,该数据库由BBC-TV的摘录组成,每个摘录包含500个目标词之一。使用中间集成和后期集成以及几种类型和级别的环境噪声来进行视听实验,即使在语音干净的情况下,也报告了纯音频网络的显着改进。提供了对目标词边界的效用以及网络在对目标词的语言环境进行建模方面的能力的进一步分析。最后,我们检查困难的单词对,并讨论视觉信息如何帮助实现更高的识别精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号