【24h】

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

机译:使用深度递归神经网络将视频翻译成自然语言

获取原文

摘要

Solving the visual symbol grounding problem has long been a goal of artificial intelligence. The field appears to be advancing closer to this goal with recent breakthroughs in deep learning for natural language grounding in static images. In this paper, we propose to translate videos directly to sentences using a unified deep neural network with both con-volutional and recurrent structure. Described video datasets are scarce, and most existing methods have been applied to toy domains with a small vocabulary of possible words. By transferring knowledge from 1.2M+ images with category labels and 100,000+ images with captions, our method is able to create sentence descriptions of open-domain videos with large vocabularies. We compare our approach with recent work using language generation metrics, subject, verb, and object prediction accuracy, and a human evaluation.
机译:解决视觉符号接地问题一直是人工智能的目标。随着深度学习在静态图像中基于自然语言的深度学习方面的最新突破,该领域似乎正在朝着这个目标迈进。在本文中,我们建议使用具有卷积和递归结构的统一深度神经网络将视频直接翻译为句子。所描述的视频数据集很稀少,并且大多数现有方法已被应用到玩具领域,而单词的词汇量很少。通过从具有类别标签的120万张图像和带有标题的10万张图像中转移知识,我们的方法能够创建具有大词汇量的开放域视频的句子描述。我们将我们的方法与最近使用语言生成指标,主语,动词和宾语预测精度以及人工评估的工作进行了比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号