Abstract:Weakly supervised temporal action localization has become one of the research hotspots in the field of video understanding due to its application potential in intelligent monitoring, video retrieval and other fields, and its low cost of training data annotation. In response to the poor localization performance caused by existing multimodal learning-based localization methods ignoring the biases inherent in each modality, we constructed an RGB action subject information compensation module and designed a optical flow-based dominant influence suppression strategy aimed at eliminating the location bias caused by each modality on the training model. Experimental results on two benchmark datasets THUMOS14 and ActivityNet v1.2 show that under multi-scale temporal intersection over union, mean average precision reached 45.3% and 26.5% respectively, overall localization performance is better than some latest methods, which demonstrates effectiveness of our proposed method. This method improves the basic localization performance of temporal action location models by compensating for bias at a coarse-grained modal level. It is also compatible with fine-grained viewpoint localization methods.