...
首页> 外文期刊>Image and vision computing >Multistage temporal convolution transformer for action segmentation
【24h】

Multistage temporal convolution transformer for action segmentation

机译:Multistage temporal convolution transformer for action segmentation

获取原文
获取原文并翻译 | 示例
           

摘要

This paper addresses fully supervised action segmentation. Transformers have been shown to have large model capacity and powerful sequence modeling abilities, and hence seem quite suitable for capturing action grammar in videos. However, their performance in video understanding still lags behind that of temporal convolutional networks, or ConvNets for short. We hypothesize that this is because: (i) ConvNets tend to generalize better than Transformers, and (ii) Transformer's large model capacity requires significantly larger training datasets than existing action segmentation benchmarks. We specify a new hybrid model, TCTr, that combines the strengths from both frameworks. TCTr seamlessly unifies depth-wise convolution and self-attention in a principled manner. Also, TCTr addresses the Transformer's quadratic computational and memory complexity in the sequence length by learning how to adaptively estimate attention from local temporal neighborhoods, instead of all frames. Our experiments show that TCTr significantly outperforms the state of the art on the Breakfast, GTEA, and 50Salads datasets.(c) 2022 Elsevier B.V. All rights reserved.

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号