Multistage temporal convolution transformer for action segmentation

Aziere Nicolas; Todorovic Sinisa

首页> 外文期刊>Image and vision computing >Multistage temporal convolution transformer for action segmentation

【24h】

Multistage temporal convolution transformer for action segmentation

机译：Multistage temporal convolution transformer for action segmentation

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相关主题

摘要

This paper addresses fully supervised action segmentation. Transformers have been shown to have large model capacity and powerful sequence modeling abilities, and hence seem quite suitable for capturing action grammar in videos. However, their performance in video understanding still lags behind that of temporal convolutional networks, or ConvNets for short. We hypothesize that this is because: (i) ConvNets tend to generalize better than Transformers, and (ii) Transformer's large model capacity requires significantly larger training datasets than existing action segmentation benchmarks. We specify a new hybrid model, TCTr, that combines the strengths from both frameworks. TCTr seamlessly unifies depth-wise convolution and self-attention in a principled manner. Also, TCTr addresses the Transformer's quadratic computational and memory complexity in the sequence length by learning how to adaptively estimate attention from local temporal neighborhoods, instead of all frames. Our experiments show that TCTr significantly outperforms the state of the art on the Breakfast, GTEA, and 50Salads datasets.(c) 2022 Elsevier B.V. All rights reserved.

著录项

来源
《Image and vision computing》 |2022年第12期|104567.1-104567.8|共8页
作者
Aziere Nicolas; Todorovic Sinisa;
展开▼
作者单位

Oregon State Univ;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种英语
中图分类
关键词
Action segmentation; Video understanding; Full supervision; Transformer network; Hybrid models; CNNs;

Multistage temporal convolution transformer for action segmentation

摘要

著录项

相关主题

期刊订阅