首页> 外文会议>International Conference of the Italian Association for Artificial Intelligence >Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era
【24h】

Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era

机译:视野中的细心模型:深度学习时代的计算显着性图

获取原文

摘要

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we propose a discussion on why convolutional neural networks (CNNs) are so accurate in saliency prediction. We present our DL architectures which combine both bottom-up cues and higher-level semantics, and incorporate the concept of time in the attentional process through LSTM recurrent architectures. Eventually, we present a video-specific architecture based on the C3D network, which can extracts spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. The merit of this work is to show how these deep networks are not mere brute-force methods tuned on massive amount of data, but represent well-defined architectures which recall very closely the early saliency models, although improved with the semantics learned by human groundtruth.
机译:估计观察图像或视频的人的注意力是一个重要的步骤,可以增强许多视觉的推断机制:图像分割和注释,视频标题,自主驾驶是一些示例。细心行为的早期阶段通常是自下而上的;再现相同的机制意味着以找到图像中体现的显着性,即,从视觉场景中弹出图像的哪些部分。该过程已经在神经科学的几十年中以及再现人皮质过程的计算模型中研究过。在过去的几年里,早期模型已被深入学习架构所取代,这比与公共数据集相比优于任何早期方法。在本文中,我们建议讨论为什么卷积神经网络(CNNS)在显着性预测中如此准确。我们介绍了我们的DL架构,该架构将自下而上的提示和更高级别的语义结合在一起,并通过LSTM经常性架构纳入注意过程中的时间概念。最终,我们提出了一种基于C3D网络的视频特定架构,可以通过3D卷积提取时空特征来模拟任务驱动的细心行为。这项工作的优点是展示这些深度网络如何不仅仅是在大量数据上调整的蛮力方法,而且代表明确定义的架构,该架构非常密切地回忆起早期的持阳性模型,尽管随着人为地面的学习的语义而改善。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号