Abstract:Aiming at the problems such as insufficient extraction of visual feature information and defects in distinguishing visual words from non-visual words in the process of image semantic captioning, a visual word adaptive image semantic captioning method based on Transformer is proposed. Channel self-attention module and adaptive attention module integrating DeBERTa pre-trained language model are used to improve the image captioning method, which is used to enhance the expression of visual signals and language signals. Firstly, ResNeXt-152 network is used as the backbone network for image feature extraction. In the feature extraction stage, ECANet channel attention mechanism is introduced to combine with self-attention in Transformer encoder to form channel self-attention module, to enhance the accuracy of feature extraction and strengthen relevant areas. Secondly, an adaptive attention module based on DeBERTa pre-trained language model is proposed to process visual words and non-visual words effectively, to measure the contribution of visual signals and language context to the generation of image captioning words. Finally, AdaMod optimizer is used to obtain the optimal network parameters, and cross-entropy loss is used to train the model. The experimental comparison was carried out on MS COCO benchmark dataset, and the effectiveness of the model was verified through quantitative evaluation and qualitative analysis. The experimental results showed that the scores of BLEU-1, BLEU-4, METEOR, ROUGE-L and CIDEr were significantly improved, among which the CIDEr index was increased by about 2%. Compared to the general Transformer model, it can make the statements describing the image more accurate and detailed.