As an essential task applied to video surveillance, person re-identification (Re-ID) suffers from variations across different cameras. In this paper we propose an effective transformer-based Re-ID framework for learning the identity-discriminative and camera-invariant feature representations. In contrast to the recent direction of using generative models to augment training data and enhance the invariance to input variations, we show that explicitly designing a novel adversarial loss from the perspective of feature representation learning helps to penalize the distribution discrepancy across multiple camera domains effectively. Recently, the pure transformer model has gained much attention due to its strong representation capabilities. We employ a pure transformer encoder to extract a global feature vector for the patch tokens of each person image. Notably, a novel cross-patch encoder is introduced to obtain structural information between image patches. Extensive experiments on three challenging datasets demonstrate the effectiveness and superiority of the proposed learning framework. (C) 2022 Published by Elsevier B.V.
展开▼