Recently, substantial research effort has focused on how to apply CNNs orRNNs to better extract temporal patterns from videos, so as to improve theaccuracy of video classification. In this paper, however, we show that temporalinformation, especially longer-term patterns, may not be necessary to achievecompetitive results on common video classification datasets. We investigate thepotential of a purely attention based local feature integration. Accounting forthe characteristics of such features in video classification, we propose alocal feature integration framework based on attention clusters, and introducea shifting operation to capture more diverse signals. We carefully analyze andcompare the effect of different attention mechanisms, cluster sizes, and theuse of the shifting operation, and also investigate the combination ofattention clusters for multimodal integration. We demonstrate the effectivenessof our framework on three real-world video classification datasets. Our modelachieves competitive results across all of these. In particular, on thelarge-scale Kinetics dataset, our framework obtains an excellent single modelaccuracy of 79.4% in terms of the top-1 and 94.0% in terms of the top-5accuracy on the validation set. The attention clusters are the backbone of ourwinner solution at ActivityNet Kinetics Challenge 2017. Code and models will bereleased soon.
展开▼