Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs

Themos Stafylakis; Muhammad Haris Khan; Georgios Tzimiropoulos

首页> 外文期刊>Computer vision and image understanding >Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs

【24h】

Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs

机译：使用残差网络和LSTM突破视听单词识别的界限

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Visual and audiovisual speech recognition are witnessing a renaissance which is largely due to the advent of deep learning methods. In this paper, we present a deep learning architecture for lipreading and audiovisual word recognition, which combines Residual Networks equipped with spatiotemporal input layers and Bidirectional LSTMs. The lipreading architecture attains 11.92% misclassification rate on the challenging Lipreading-In-The-Wild database, which is composed of excerpts from BBC-TV, each containing one of the 500 target words. Audiovisual experiments are performed using both intermediate and late integration, as well as several types and levels of environmental noise, and notable improvements over the audio-only network are reported, even in the case of clean speech. A further analysis on the utility of target word boundaries is provided, as well as on the capacity of the network in modeling the linguistic context of the target word. Finally, we examine difficult word pairs and discuss how visual information helps towards attaining higher recognition accuracy.

机译：视觉和视听语音识别正在复兴，这在很大程度上是由于深度学习方法的出现。在本文中，我们提出了一种用于唇读和视听单词识别的深度学习架构，该架构结合了配备时空输入层和双向LSTM的残差网络。在具有挑战性的Lipreading-In-The-Wild数据库中，唇读架构的误分类率达到11.92％，该数据库由BBC-TV的摘录组成，每个摘录包含500个目标词之一。使用中间集成和后期集成以及几种类型和级别的环境噪声来进行视听实验，即使在语音干净的情况下，也报告了纯音频网络的显着改进。提供了对目标词边界的效用以及网络在对目标词的语言环境进行建模方面的能力的进一步分析。最后，我们检查困难的单词对，并讨论视觉信息如何帮助实现更高的识别精度。

著录项

来源
《Computer vision and image understanding》 |2018年第novade期|22-32|共11页
作者
Themos Stafylakis; Muhammad Haris Khan; Georgios Tzimiropoulos;
展开▼
作者单位

Computer Vision Laboratory University of Nottingham;

Computer Vision Laboratory University of Nottingham|Electrical Engineering Department COMSATS Lahore Campus;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Audiovisual speech recognition; Lipreading; Deep learning;

机译：视听语音识别;唇读深度学习;

相似文献

外文文献
中文文献
专利

1. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework [J] . Martin Woellmer, Moritz Kaiser, Florian Eyben, Image and Vision Computing . 2013,第2期

机译：LSTM视听影响识别框架中连续情感的建模
2. Conflux LSTMs Network: A Novel Approach for Multi-View Action Recognition [J] . Ullah Amin, Muhammad Khan, Hussain Tanveer, Neurocomputing . 2021,第MAYa7期

机译：Conflux LSTMS网络：一种多视图动作识别的新方法
3. TARDB-Net: triple-attention guided residual dense and BiLSTM networks for hyperspectral image classification [J] . Cai Weiwei, Liu Botao, Wei Zhanguo, Multimedia Tools and Applications . 2021,第7期

机译：TARDB-NET：用于高光谱图像分类的三重引导剩余密集和BILSTM网络
4. Two-Stream Designed 2D/3D Residual Networks with Lstms for Action Recognition in Videos [C] . Lifei Song, Liguo Weng, Lingfeng Wang, IEEE International Conference on Image Processing . 2018

机译：具有Lstms的两流设计2D / 3D残差网络，用于视频中的动作识别
5. Spoken word recognition and serial recall of words from the giant component and words from lexical islands in the phonological network. [D] . Siew, Cynthia S. Q. 2014

机译：语音网络中来自巨型成分的单词的口语单词识别和连续回想以及来自词汇岛的单词。
6. LSTM Networks Using Smartphone Data for Sensor-Based Human Activity Recognition in Smart Homes [O] . Sakorn Mekruksavanich, Anuchit Jitpattanakul 2021

机译：LSTM网络使用智能家庭的基于传感器的人类活动识别的智能手机数据
7. Dynamic Sign Language Recognition Based on Video Sequence With BLSTM-3D Residual Networks [O] . Yanqiu Liao, Pengwen Xiong, Weidong Min, 2019

机译：基于BLSTM-3D残差网络的视频序列的动态手语识别

Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs

摘要

著录项

相似文献

相关主题

期刊订阅