首页> 外文会议>European conference on computer vision >Watch Hours in Minutes: Summarizing Videos with User Intent
【24h】

Watch Hours in Minutes: Summarizing Videos with User Intent

机译:在几分钟内观看时间:通过用户意图汇总视频

获取原文

摘要

With the ever increasing growth of videos, automatic video summarization has become an important task which has attracted lot of interest in the research community. One of the challenges which makes it a hard problem to solve is presence of multiple 'correct answers'. Because of the highly subjective nature of the task, there can be different "ideal" summaries of a video. Modelling user intent in the form of queries has been posed in literature as a way to alleviate this problem. The query-focused summary is expected to contain shots which are relevant to the query in conjunction with other important shots. For practical deployments in which very long videos need to be summarized, this need to capture user's intent becomes all the more pronounced. In this work, we propose a simple two stage method which takes user query and video as input and generates a query-focused summary. Specifically, in the first stage, we employ attention within a segment and across all segments, combined with the query to learn the feature representation of each shot. In the second stage, such learned features are again fused with the query to learn the score of each shot by regressing through fully connected layers. We then assemble the summary by arranging the top scoring shots in chronological order. Extensive experiments on a benchmark query-focused video summarization dataset for long videos give better results as compared to the current state of the art, thereby demonstrating the effectiveness of our method even without employing computationally expensive architectures like LSTMs, variational autoencoders, GANs or reinforcement learning, as done by most past works.
机译:随着视频的增长增长,自动视频摘要已成为对研究界感兴趣的重要任务。解决问题的挑战之一是解决问题的难题是存在多重“正确答案”。由于任务的高度主观性质,可以有不同的“理想”视频摘要。以查询形式的建模用户意图已经在文献中构成,作为缓解此问题的方式。预计将查询摘要将包含与其他重要镜头相结合的查询相关的镜头。对于实际部署,其中需要总结很长的视频,这需要捕获用户的意图变得更加明显。在这项工作中,我们提出了一种简单的两个阶段方法,将用户查询和视频作为输入,并生成偏心摘要。具体而言,在第一阶段,我们在段内和跨所有段内的注意力,结合查询来学习每个镜头的特征表示。在第二阶段,这种学习的功能再次与查询融合,以通过通过完全连接的层回归每个镜头的分数。然后,我们通过按时间顺序排列最高评分镜头来组装摘要。对于长视频的基准查询视频摘要数据集进行了广泛的实验,与本领域的当前状态相比,表明即使在不采用LSTMS,变分性自动码,GAN或加强学习等计算昂贵的架构的情况下,也表明了我们方法的有效性,大多数过去的作品所做的那样。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号