[关键词]
[摘要]
跨模态语音分离方法同时利用了视听觉模态信息,较单模态方法在精度和稳定性方面均取得了较大提升。现有跨模态语音分离方法大多仅适用于高清人脸图像场景,存在隐私侵犯、个人信息泄露等问题。针对该问题,提出一种利用低分辨率图像的多层次特征融合跨模态语音分离方法。该方法针对低分辨率图像构建视觉特征提取器,采用“快—中—慢”三支路结构进行视觉特征提取,每条支路以不同速率处理视频帧,提取不同层次的人脸唇部动态特征,分别对应与声学特征相关的音素级、词语级别及话语级特征,并在特征提取过程中分阶段进行特征融合。为验证所提方法的有效性,构建了不同数据集下与纯语音方法的对比实验、不同分辨率之间的对比实验与在LRS3数据集上模型结构消融实验,共三类实验。实验结果表明,所提方法不仅可以在高分辨率图像下完成语音分离,而且在低分辨率下仍能保持良好的分离性能。在LRS3、LRS2与GRID数据集上,该方法与单模态语音分离方法相比,SI-SNRi分别获得了4.3%、10.6%与26.5%的提升,SDRi分别获得了4.6%、11.3%与21.8%的改善。
[Key word]
[Abstract]
Greater accuracy and stability are achieved by applying the cross-modal speech separation method and the simultaneous use of audio-visual modal information, in contrast to those obtained through the use of single-modal methods. Most of the existing cross-modal speech separation methods are only applicable to scenarios with high-definition face images, may incur the risks of the intrusion of personal privacy and personal privacy disclosure. In light of this, we propose a multi-level feature fusion cross-modal speech separation method based on low-resolution images. Visual features are extracted by using this method that is meant for constructing a visual feature extractor for low-resolution images, with a three-branch structure of “fast-medium-slow” adopted. With each branch processing video frames at different rates, dynamic features of faces and lips at different levels are extracted. These features, corresponding to phoneme level, word level and discourse level features related to acoustic features, respectively, are fused in stages during feature extraction. To verify the effectiveness of the proposed method, we construct three types of experiments: the comparison made between the results obtained on the basis of different data sets and that achieved on the basis of audio-only method, the comparison made among the results obtained at different resolution levels on the basis of different data sets, and the model structure ablation experiments conducted on the basis of LRS3. The results show that, with the application of the proposed method, we can not only achieve speech separation on the basis of images of high resolution levels, but also obtain good separation performance on the basis of images of low resolution levels. On the basis of LRS3, LRS2, and GRID, the network achieves the improvement of 4.3%, 10.6%, and 26.5%, respectively, in terms of the SI-SNRi, and the improvement of 4.6%, 11.3%, and 21.8%, respectively, in terms of SDRi, compared with the results obtained by applying the unimodal speech separation model in the same conditions.
[中图分类号]
TP181;TP183
[基金项目]
国家重点研发计划“基于云边环境的智能装备精准运维大数据分析技术”(2021YFB1715000);四川省重点研发计划“跨模态多人对话机器人系统建模与学习”(2021YFG0315)。