Abstract:Although few-shot action recognition based on the metric learning paradigm has achieved significant success, it fails to address the following issues: 1) inadequate action relation modeling and underutilization of multi-modal information; 2) challenges in handling video matching problems with different lengths and speeds, and misaligned video sub-actions. To address these limitations, we propose a two-stream joint matching (TSJM) method based on mutual information, which consists of two modules: multi-modal contrastive learning module (MCL) and joint matching module (JMM). The MCL extensively explores inter-modal mutual information relationships, and thoroughly extracts modal information to enhance the modeling of action relationships. The JMM is primarily designed to simultaneously solve the aforementioned video matching problems. By integrating dynamic time warping (DTW) and bipartite graph matching, it optimizes the matching process to generate the final alignment results, thereby achieving high few-shot action recognition accuracy. We evaluate the proposed method on two widely used few-shot action recognition datasets (SSV2 and Kinetics), and conduct comprehensive ablation experiments to substantiate the efficacy of our approach.