Abstract:Greater accuracy and stability are achieved by applying the cross-modal speech separation method and the simultaneous use of audio-visual modal information, in contrast to those obtained through the use of single-modal methods. Most of the existing cross-modal speech separation methods are only applicable to scenarios with high-definition face images, may incur the risks of the intrusion of personal privacy and personal privacy disclosure. In light of this, we propose a multi-level feature fusion cross-modal speech separation method based on low-resolution images. Visual features are extracted by using this method that is meant for constructing a visual feature extractor for low-resolution images, with a three-branch structure of “fast-medium-slow” adopted. With each branch processing video frames at different rates, dynamic features of faces and lips at different levels are extracted. These features, corresponding to phoneme level, word level and discourse level features related to acoustic features, respectively, are fused in stages during feature extraction. To verify the effectiveness of the proposed method, we construct three types of experiments: the comparison made between the results obtained on the basis of different data sets and that achieved on the basis of audio-only method, the comparison made among the results obtained at different resolution levels on the basis of different data sets, and the model structure ablation experiments conducted on the basis of LRS3. The results show that, with the application of the proposed method, we can not only achieve speech separation on the basis of images of high resolution levels, but also obtain good separation performance on the basis of images of low resolution levels. On the basis of LRS3, LRS2, and GRID, the network achieves the improvement of 4.3%, 10.6%, and 26.5%, respectively, in terms of the SI-SNRi, and the improvement of 4.6%, 11.3%, and 21.8%, respectively, in terms of SDRi, compared with the results obtained by applying the unimodal speech separation model in the same conditions.