基于Transformer网络的视觉词自适应图像语义描述
DOI:
作者:
作者单位:

兰州理工大学

作者简介:

通讯作者:

中图分类号:

TP391????? ??? ?

基金项目:

国家自然科学基金(No.62061042);甘肃省工业过程先进控制重点实验室开发基金项目(No.2022KX10)


Adaptive Image Semantic Captioning of Sight word Based on Transformer Network
Author:
Affiliation:

Lanzhou University of Technology

Fund Project:

National Natural Science Foundation of China (No.62061042); Gansu Provincial Key Laboratory for Advanced Control of Industrial Processes Development Fund Project (No.2022KX10)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对图像语义描述过程中出现图像视觉特征信息提取不够充分、区分视觉单词与非视觉单词存在缺陷等问题,提出一种基于Transformer的视觉词自适应图像语义描述方法。使用通道自注意力模块与融合DeBERTa预训练语言模型的自适应注意模块对图像描述方法进行改进,用于增强视觉信号与语言信号的表达。首先采用ResNeXt-152网络作为主干网络对图像进行特征提取,并在特征提取阶段引入了ECANet通道注意力机制与Transformer编码器中的自注意力相结合形成通道自注意力模块,以增强特征提取的准确度来强化相关区域。其次在解码器之上提出了融合DeBERTa预训练语言模型的自适应注意模块,来有效的处理视觉单词与非视觉单词,以衡量视觉信号和语言上下文对图像描述单词生成的贡献。最后采用AdaMod优化器获取最优网络参数,并采用交叉熵损失来训练模型。在MS COCO基准数据集上进行实验对比,并通过定量评估和定性分析验证了该模型的有效性,实验结果表明在BLEU-1、BLEU-4、METEOR、ROUGE-L、CIDEr得分均有显著提升,其中在CIDEr指标上提高2%左右。相较于一般的Transformer模型能够使描述图像的语句更加准确与详细。

    Abstract:

    Aiming at the problems such as insufficient extraction of visual feature information and defects in distinguishing visual words from non-visual words in the process of image semantic captioning, a visual word adaptive image semantic captioning method based on Transformer is proposed. Channel self-attention module and adaptive attention module integrating DeBERTa pre-trained language model are used to improve the image captioning method, which is used to enhance the expression of visual signals and language signals. Firstly, ResNeXt-152 network is used as the backbone network for image feature extraction. In the feature extraction stage, ECANet channel attention mechanism is introduced to combine with self-attention in Transformer encoder to form channel self-attention module, to enhance the accuracy of feature extraction and strengthen relevant areas. Secondly, an adaptive attention module based on DeBERTa pre-trained language model is proposed to process visual words and non-visual words effectively, to measure the contribution of visual signals and language context to the generation of image captioning words. Finally, AdaMod optimizer is used to obtain the optimal network parameters, and cross-entropy loss is used to train the model. The experimental comparison was carried out on MS COCO benchmark dataset, and the effectiveness of the model was verified through quantitative evaluation and qualitative analysis. The experimental results showed that the scores of BLEU-1, BLEU-4, METEOR, ROUGE-L and CIDEr were significantly improved, among which the CIDEr index was increased by about 2%. Compared to the general Transformer model, it can make the statements describing the image more accurate and detailed.

    参考文献
    相似文献
    引证文献
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-10-07
  • 最后修改日期:2024-04-10
  • 录用日期:2024-04-16
  • 在线发布日期:
  • 出版日期: