基于Transformer网络的视觉词自适应图像语义描述
作者单位:

兰州理工大学

中图分类号:

TP391????? ??? ?

基金项目:

国家自然科学基金(No.62061042);甘肃省工业过程先进控制重点实验室开发基金项目(No.2022KX10)


Adaptive Image Semantic Captioning of Sight word Based on Transformer Network
Author:
Affiliation:

Lanzhou University of Technology

Fund Project:

National Natural Science Foundation of China (No.62061042); Gansu Provincial Key Laboratory for Advanced Control of Industrial Processes Development Fund Project (No.2022KX10)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [45]
  • | |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    针对图像语义描述过程中出现图像视觉特征信息提取不够充分、区分视觉单词与非视觉单词存在缺陷等问题,提出一种基于Transformer的视觉词自适应图像语义描述方法。使用通道自注意力模块与融合DeBERTa预训练语言模型的自适应注意模块对图像描述方法进行改进,用于增强视觉信号与语言信号的表达。首先采用ResNeXt-152网络作为主干网络对图像进行特征提取,并在特征提取阶段引入了ECANet通道注意力机制与Transformer编码器中的自注意力相结合形成通道自注意力模块,以增强特征提取的准确度来强化相关区域。其次在解码器之上提出了融合DeBERTa预训练语言模型的自适应注意模块,来有效的处理视觉单词与非视觉单词,以衡量视觉信号和语言上下文对图像描述单词生成的贡献。最后采用AdaMod优化器获取最优网络参数,并采用交叉熵损失来训练模型。在MS COCO基准数据集上进行实验对比,并通过定量评估和定性分析验证了该模型的有效性,实验结果表明在BLEU-1、BLEU-4、METEOR、ROUGE-L、CIDEr得分均有显著提升,其中在CIDEr指标上提高2%左右。相较于一般的Transformer模型能够使描述图像的语句更加准确与详细。

    Abstract:

    Aiming at the problems such as insufficient extraction of visual feature information and defects in distinguishing visual words from non-visual words in the process of image semantic captioning, a visual word adaptive image semantic captioning method based on Transformer is proposed. Channel self-attention module and adaptive attention module integrating DeBERTa pre-trained language model are used to improve the image captioning method, which is used to enhance the expression of visual signals and language signals. Firstly, ResNeXt-152 network is used as the backbone network for image feature extraction. In the feature extraction stage, ECANet channel attention mechanism is introduced to combine with self-attention in Transformer encoder to form channel self-attention module, to enhance the accuracy of feature extraction and strengthen relevant areas. Secondly, an adaptive attention module based on DeBERTa pre-trained language model is proposed to process visual words and non-visual words effectively, to measure the contribution of visual signals and language context to the generation of image captioning words. Finally, AdaMod optimizer is used to obtain the optimal network parameters, and cross-entropy loss is used to train the model. The experimental comparison was carried out on MS COCO benchmark dataset, and the effectiveness of the model was verified through quantitative evaluation and qualitative analysis. The experimental results showed that the scores of BLEU-1, BLEU-4, METEOR, ROUGE-L and CIDEr were significantly improved, among which the CIDEr index was increased by about 2%. Compared to the general Transformer model, it can make the statements describing the image more accurate and detailed.

    参考文献
    [1] 石义乐,杨文忠,杜慧祥,等. 基于深度学习的图像描述综述[J].电子学报, 2021, 049(010):2048-2060.SHI Y L, YANG W Z, DU H X, et al. An overview of image description based on deep learning [J]. Acta Electronica Sinica, 2021, 049(010):2048-2060. (In Chinese)
    [2] 马龙龙, 韩先培, 孙乐. 图像的文本描述方法研究综述[J]. 中文信息学报, 2018, 32(4): 1-12.MA L L, HAN X P, SUN L A Review of Text Description Methods for Images [J] Chinese Journal of Information Technology, 2018, 32 (4): 1-12
    [3] JING B, XIE P, XING E. On the automatic generation of medical imaging reports[J]. arXiv preprint arXiv:1711.08195, 2017.
    [4] ZHANG Z, XIE Y, XING F, et al. Mdnet: A semantically and visually interpretable medical image diagnosis network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 6428-6436.
    [5] FAN C, ZHANG Z, CRANDALL D J. Deepdiary: Lifelogging image captioning and summarization[J]. Journal of Visual Communication and Image Representation, 2018, 55: 40-55.
    [6] KULKARNI G, PREMRAJ V, ORDONEZ V, et al. Babytalk: Understanding and generating simple image descriptions[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(12): 2891-2903.
    [7] KUZNETSOVA P, ORDONEZ V, BERG T L, et al. Treetalk: Composition and compression of trees for image descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 351-362.
    [8] 盖荣丽, 蔡建荣, 王诗宇, 等. 卷积神经网络在图像识别中的应用研究综述[J]. 小型微型计算机系统, 2021, 42(9): 1980-1984.
    GAI R L, CAI J R, WANG S Y, et al. Review on the application of Convolutional neural networks in image recognition [J]. Minicomicompany Systems, 2021, 42(9): 1980-1984. (In Chinese)
    [9] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//International conference on machine learning. PMLR, 2015: 2048-2057.
    [10] RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 7008-7024.
    [11] Yang Z, Yuan Y, Wu Y, et al. Review networks for caption generation[J]. Advances in neural information processing systems, 2016, 29.
    [12] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
    [13] JI J, HUANG X, SUN X, et al. Multi-Branch Distance-Sensitive Self-Attention Network for Image Captioning[J]. IEEE Transactions on Multimedia, 2022.
    [14] HE P, LIU X, GAO J, et al. Deberta: Decoding-enhanced bert with disentangled attention[J]. arXiv preprint arXiv:2006.03654, 2020.
    [15] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]//Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014: 740-755.
    [16] ZHU X, WANG W, GUO L, et al. AutoCaption: Image captioning with neural architecture search[J]. arXiv preprint arXiv:2012.09742, 2020.
    [17] XIE S, GIRSHICK R, DOLLáR P, et al. Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1492-1500.
    [18] CHO K, VAN MERRI?NBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint arXiv:1406.1078, 2014.
    [19] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[J]. Advances in neural information processing systems, 2014, 27.
    [20] DING J, REN X, LUO R, et al. An adaptive and momental bound method for stochastic learning[J]. arXiv preprint arXiv:1910.12249, 2019.
    [21] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
    [22] HUANG L, WANG W, CHEN J, et al. Attention on attention for image captioning[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 4634-4643.
    [23] PAN Y, YAO T, LI Y, et al. X-linear attention networks for image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10971-10980.
    [24] CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10578-10587.
    [25] HERDADE S, KAPPELER A, BOAKYE K, et al. Image captioning: Transforming objects into words[J]. Advances in neural information processing systems, 2019, 32.
    [26] GUO L, LIU J, ZHU X, et al. Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10327-10336.
    [27] LI G, ZHU L, LIU P, et al. Entangled transformer for image captioning[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 8928-8937.
    [28] LIU F, LIU Y, REN X, et al. Aligning visual regions and textual concepts for semantic-grounded image representations[J]. Advances in Neural Information Processing Systems, 2019, 32.
    [29] LUO Y, JI J, SUN X, et al. Dual-level collaborative transformer for image captioning[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(3): 2286-2293.
    [30] PAPINENI K, ROUKOS S, WARD T, et al. Bleu: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002: 311-318.
    [31] BANERJEE S, LAVIE A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and summarization. 2005: 65-72.
    [32] LIN C Y. Rouge: A package for automatic evaluation of summaries[C]//Text summarization branches out. 2004: 74-81.
    [33] VEDANTAM R, LAWRENCE ZITNICK C, PARIKH D. Cider: Consensus-based image description evaluation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 4566-4575.
    [34] ANDERSON P, FERNANDO B, JOHNSON M, et al. Spice: Semantic propositional image caption evaluation[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer International Publishing, 2016: 382-398.
    [35] KARPATHY A, FEI-FEI L. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3128-3137.
    [36] RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 7008-7024.
    [37] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6077-6086.
    [38] KE L, PEI W, LI R, et al. Reflective decoding network for image captioning[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 8888-8897.
    [39] YAO T, PAN Y, LI Y, et al. Exploring visual relationship for image captioning[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 684-699.
    [40] YANG X, TANG K, ZHANG H, et al. Auto-encoding scene graphs for image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 10685-10694.
    [41] HERDADE S, KAPPELER A, BOAKYE K, et al. Image captioning: Transforming objects into words[J]. Advances in neural information processing systems, 2019, 32.
    [42] CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10578-10587.
    [43] PAN Y, YAO T, LI Y, et al. X-linear attention networks for image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10971-10980.
    [44] ZHANG X, SUN X, LUO Y, et al. Rstnet: Captioning with adaptive attention on visual and non-visual words[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 15465-15474.
    [45] LU J, XIONG C, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 375-383.
    相似文献
    引证文献
    引证文献 [0] 您输入的地址无效!
    没有找到您想要的资源,您输入的路径无效!

    网友评论
    网友评论
    分享到微博
    发 布
引用本文
分享
文章指标
  • 点击次数:158
  • 下载次数: 0
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2023-10-07
  • 最后修改日期:2024-04-10
  • 录用日期:2024-04-16
文章二维码