基于模态偏差补偿的弱监督时序行为定位方法
作者:
作者单位:

1.重庆大学大数据与软件学院;2.上海交通大学计算机科学与工程系

中图分类号:

TP389.1???????

基金项目:

国家自然科学基金面上项目(62176031);重庆市技术创新与应用发展专项重点项目(CSTB2022TIAD-KPX0100)。


Weakly supervised temporal action localization method based on modal bias compensation
Author:
Affiliation:

1.School of Big Data Software Engineering,Chongqing University;2.Department of Computer Science and Engineering, Shanghai Jiao Tong University

Fund Project:

National Natural Science Foundation of China General Program(62176031); The Chongqing Special Key Project for Technological Innovation and Application Development(CSTB2022TIAD-KPX0100).

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [36]
  • | |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    弱监督时序行为定位由于在智能监控、视频检索等领域展现出应用潜力,且训练数据标注成本较低,成为了视频理解领域的研究热点之一。针对已有基于多模态学习的定位方法忽略各个模态自身的偏差导致的定位性能欠佳问题,构建了RGB运动主体信息抑制模块,并设计了光流主导影响抑制策略,旨在消除各个模态对训练模型造成的定位偏差。在两个基准数据集THUMOS14和ActivityNet v1.2上的实验结果显示,多尺度时序交并比下的平均精确率均值分别达到了45.3%、26.5%,整体定位性能优于主流方法,实验结果表明了所提出方法的有效性。本方法的优势是仅在粗粒度的模态级别探索各个模态带来的定位偏差并进行补偿,提高了基于多模态学习的时序行为定位模型的基础定位性能,有利于和细粒度视角下的定位方法相兼容。

    Abstract:

    Weakly supervised temporal action localization has become one of the research hotspots in the field of video understanding due to its application potential in intelligent monitoring, video retrieval and other fields, and its low cost of training data annotation. In response to the poor localization performance caused by existing multimodal learning-based localization methods ignoring the biases inherent in each modality, we constructed an RGB action subject information compensation module and designed a optical flow-based dominant influence suppression strategy aimed at eliminating the location bias caused by each modality on the training model. Experimental results on two benchmark datasets THUMOS14 and ActivityNet v1.2 show that under multi-scale temporal intersection over union, mean average precision reached 45.3% and 26.5% respectively, overall localization performance is better than some latest methods, which demonstrates effectiveness of our proposed method. This method improves the basic localization performance of temporal action location models by compensating for bias at a coarse-grained modal level. It is also compatible with fine-grained viewpoint localization methods.

    参考文献
    [1] 朱煜, 赵江坤, 王逸宁, 郑兵兵. 基于深度学习的人体行为识别算法综述[J]. 自动化学报, 2016, 42(6): 848-857. doi: 10.16383/j.aas.2016.c150710
    [2] 罗会兰,王婵娟,卢飞.视频行为识别综述[J].通信学报,2018,39(06):169-180.
    [3] Wang L, Xiong Y, Lin D, et al. Untrimmednets for weakly supervised action recognition and detection[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017: 4325-4334.
    [4] Zhai Y, Wang L, Tang W, et al. Two-stream consensus network for weakly-supervised temporal action localization[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16. Springer International Publishing, 2020: 37-54.
    [5] Hong F T, Feng J C, Xu D, et al. Cross-modal consensus network for weakly supervised temporal action localization[C]//Proceedings of the 29th ACM international conference on multimedia. 2021: 1591-1599.
    [6] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
    [7] Yang W, Zhang T, Yu X, et al. Uncertainty guided collaborative training for weakly supervised temporal action detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 53-63.
    [8] Ji Y, Jia X, Lu H, et al. Weakly-supervised temporal action localization via cross-stream collaborative learning[C]//Proceedings of the 29th ACM international conference on multimedia. 2021: 853-861.
    [9] Huang L, Wang L, Li H. Multi-modality self-distillation for weakly supervised temporal action localization[J]. IEEE Transactions on Image Processing, 2022, 31: 1504-1519.
    [10] Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6299-6308.
    [11] Li B, Tang X, Qi X, et al. EMU: Effective multi-hot encoding net for lightweight scene text recognition with a large character set[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(8): 5374-5385.
    [12] Lee P, Uh Y, Byun H. Background suppression network for weakly-supervised temporal action localization[C]//Proceedings of the AAAI conference on artificial intelligence. 2020, 34(07): 11320-11327.
    [13] Liu Y, Tang Y, Zhang N, et al. Prior-Enhanced Temporal Action Localization Using Subject-Aware Spatial Attention[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023: 1-5.
    [14] Ge Z, Liu S, Wang F, et al. Yolox: Exceeding yolo series in 2021[J]. arXiv preprint arXiv:2107.08430, 2021.
    [15] He K, Gkioxari G, Dollár P, et al. Mask r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2961-2969.
    [16] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015.
    [17] Shou Z, Gao H, Zhang L, et al. Autoloc: Weakly-supervised temporal action localization in untrimmed videos[C]//Proceedings of the european conference on computer vision (ECCV). 2018: 154-171.
    [18] Bodla N, Singh B, Chellappa R, et al. Soft-NMS--improving object detection with one line of code[C]//Proceedings of the IEEE international conference on computer vision. 2017: 5561-5569.
    [19] Jiang Y G, Liu J, Roshan Zamir A, et al. THUMOS challenge: Action recognition with a large number of classes[EB/OL]. 2014. http://crcv.ucf.edu/THUMOS14/.
    [20] Caba Heilbron F, Escorcia V, Ghanem B, et al. Activitynet: A large-scale video benchmark for human activity understanding[C]//Proceedings of the ieee conference on computer vision and pattern recognition. 2015: 961-970.
    [21] Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild[J]. arXiv preprint arXiv:1212.0402, 2012.
    [22] Paul S, Roy S, Roy-Chowdhury A K. W-talc: Weakly-supervised temporal activity localization and classification[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 563-579.
    [23] Islam A, Radke R. Weakly supervised temporal action localization using deep metric learning[C]//Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2020: 547-556.
    [24] Kay W, Carreira J, Simonyan K, et al. The kinetics human action video dataset[J]. arXiv preprint arXiv:1705.06950, 2017.
    [25] Gong G, Wang X, Mu Y, et al. Learning temporal co-attention models for unsupervised video action localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 9819-9828.
    [26] Min K, Corso J J. Adversarial background-aware loss for weakly-supervised temporal activity localization[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. Springer International Publishing, 2020: 283-299.
    [27] Liu Z, Wang L, Zhang Q, et al. ACSNet: Action-context separation network for weakly supervised temporal action localization[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(3): 2233-2241.
    [28] Islam A, Long C, Radke R. A hybrid attention mechanism for weakly-supervised temporal action localization[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(2): 1637-1645.
    [29] Lee P, Wang J, Lu Y, et al. Weakly-supervised temporal action localization by uncertainty modeling[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(3): 1854-1862.
    [30] Gao J, Chen M, Xu C. Fine-grained temporal contrastive learning for weakly-supervised temporal action localization[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 19999-20009.
    [31] Shi H, Zhang X Y, Li C, et al. Dynamic graph modeling for weakly-supervised temporal action localization[C]//Proceedings of the 30th ACM international conference on multimedia. 2022: 3820-3828.
    [32] Yang Z, Qin J, Huang D. Acgnet: Action complement graph network for weakly-supervised temporal action localization[C]//Proceedings of the AAAI conference on artificial intelligence. 2022, 36(3): 3090-3098.
    [33] Ren H, Yang W, Zhang T, et al. Proposal-based multiple instance learning for weakly-supervised temporal action localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 2394-2404.
    [34] 党伟超,张磊,高改梅,等.融合片段对比学习的弱监督动作定位方法[J].计算机应用,2024,44(02):548-555.
    [35] Wu K, Luo W, Xie Z, et al. Ensemble Prototype Network For Weakly Supervised Temporal Action Localization[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024.
    [36] Zhang T, Li R, Feng P, et al. Integration of Global and Local Knowledge for Foreground Enhancing in Weakly Supervised Temporal Action Localization[J]. IEEE Transactions on Multimedia, 2024.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文
分享
文章指标
  • 点击次数:108
  • 下载次数: 0
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2024-04-09
  • 最后修改日期:2024-05-22
  • 录用日期:2024-07-31
文章二维码