...
首页> 外文期刊>Image and vision computing >ArCo: Attention-reinforced transformer with contrastive learning for image captioning
【24h】

ArCo: Attention-reinforced transformer with contrastive learning for image captioning

机译:ArCo: Attention-reinforced transformer with contrastive learning for image captioning

获取原文
获取原文并翻译 | 示例
           

摘要

Image captioning is a significant step toward achieving automatic interactions between humans and com-puters, in which a textual sequence of the content of an image is generated. Recently, the transformer-based encoder-decoder paradigm has made great achievements in image captioning. This method is usually trained with a cross-entropy loss function. However, for various captions of images with the same meaning, the computed losses may be different. The result is that the descriptions of images tend to be consistent, which limits the diversity of image captioning. In this paper, we present an attention-reinforced trans-former, a transformer-based architecture for image captioning. The architecture improves the image encoding stage, which exploits the relationships between image regions by integrating a feature attention block (FAB). During the training phase, we trained the model with a combination of cross-entropy loss and contrastive loss. We experimentally explored the performance of ArCo and other fully attentive models. We also validated the baseline of the transformer for image captioning with different pre-trained models. Our proposed approach was demonstrated to achieve a new state-of-the-art performance on the offline 'Karpathy' test split and online test server.(c) 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http:// creativecommons.org/licenses/by/4.0/).

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号