基于多层次特征融合的跨模态语音分离方法
作者单位:

西南科技大学

中图分类号:

TP181;TP183

基金项目:

国家重点研发计划“基于云边环境的智能装备精准运维大数据分析技术”(2021YFB1715000);四川省重点研发计划“跨模态多人对话机器人系统建模与学习”(2021YFG0315)。


Cross-modal speech separation method by using multi-level feature fusion
Author:
Affiliation:

School of Information Engineering,Southwest University of Science and Technology

Fund Project:

National Key R&D Program of China(2021YFB1715000);Sichuan Science and Technology Program(2021YFG0315)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [23]
  • | |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    跨模态语音分离方法同时利用了视听觉模态信息,较单模态方法在精度和稳定性方面均取得了较大提升。现有跨模态语音分离方法大多仅适用于高清人脸图像场景,存在隐私侵犯、个人信息泄露等问题。针对该问题,提出一种利用低分辨率图像的多层次特征融合跨模态语音分离方法。该方法针对低分辨率图像构建视觉特征提取器,采用“快—中—慢”三支路结构进行视觉特征提取,每条支路以不同速率处理视频帧,提取不同层次的人脸唇部动态特征,分别对应与声学特征相关的音素级、词语级别及话语级特征,并在特征提取过程中分阶段进行特征融合。为验证所提方法的有效性,构建了不同数据集下与纯语音方法的对比实验、不同分辨率之间的对比实验与在LRS3数据集上模型结构消融实验,共三类实验。实验结果表明,所提方法不仅可以在高分辨率图像下完成语音分离,而且在低分辨率下仍能保持良好的分离性能。在LRS3、LRS2与GRID数据集上,该方法与单模态语音分离方法相比,SI-SNRi分别获得了4.3%、10.6%与26.5%的提升,SDRi分别获得了4.6%、11.3%与21.8%的改善。

    Abstract:

    Greater accuracy and stability are achieved by applying the cross-modal speech separation method and the simultaneous use of audio-visual modal information, in contrast to those obtained through the use of single-modal methods. Most of the existing cross-modal speech separation methods are only applicable to scenarios with high-definition face images, may incur the risks of the intrusion of personal privacy and personal privacy disclosure. In light of this, we propose a multi-level feature fusion cross-modal speech separation method based on low-resolution images. Visual features are extracted by using this method that is meant for constructing a visual feature extractor for low-resolution images, with a three-branch structure of “fast-medium-slow” adopted. With each branch processing video frames at different rates, dynamic features of faces and lips at different levels are extracted. These features, corresponding to phoneme level, word level and discourse level features related to acoustic features, respectively, are fused in stages during feature extraction. To verify the effectiveness of the proposed method, we construct three types of experiments: the comparison made between the results obtained on the basis of different data sets and that achieved on the basis of audio-only method, the comparison made among the results obtained at different resolution levels on the basis of different data sets, and the model structure ablation experiments conducted on the basis of LRS3. The results show that, with the application of the proposed method, we can not only achieve speech separation on the basis of images of high resolution levels, but also obtain good separation performance on the basis of images of low resolution levels. On the basis of LRS3, LRS2, and GRID, the network achieves the improvement of 4.3%, 10.6%, and 26.5%, respectively, in terms of the SI-SNRi, and the improvement of 4.6%, 11.3%, and 21.8%, respectively, in terms of SDRi, compared with the results obtained by applying the unimodal speech separation model in the same conditions.

    参考文献
    [1] 刘文举, 聂帅, 梁山, et al. 基于深度学习语音分离技术的研究现状与进展[J]. 自动化学报, 2016, 42(6):15.
    [2] Luo Y, Mesgarani N. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8):1256-1266.
    [3] Luo Y, Chen Z, Yoshioka T. Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation [C]// ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, IEEE, 2020: 46-50.
    [4] Subakan C , Ravanelli M , Cornell S , et al. Attention is All You Need in Speech Separation[C]// ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, ON, Canada, IEEE, 2021: 21-25.
    [5] Lutati S, Nachmani E, Wolf L. SepIt Approaching a Single Channel Speech Separation Bound [J]. arXiv preprint arXiv:220511801, 2022.
    [6] Pip and pop: nonspatial auditory signals improve spatial visual search [J]. Journal of Experimental Psychology: Human Perception and Performance, 2008, 34(5): 1053-1065.
    [7] Michelsanti D , Tan Z H , Zhang S X , et al. An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29:1368-1396.
    [8] Ong J, Vo B T, Nordholm S, et al. Audio-Visual Based Online Multi-Source Separation [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30(1219-1234).
    [9] Sadeghi M , Leglaive S , Alameda-Pineda X , et al. Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoders[J]. 2019.
    [10] Gao R , Grauman K . VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency[C]// Computer Vision and Pattern Recognition (CVPR), F, 2021.
    [11] Wang X, Kong X, Peng X, et al. Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation [C]//proceedings of the INTERSPEECH, F, 2022.
    [12] Afouras T, Chung J S, Zisserman A. My lips are concealed: Audio-visual speech enhancement through obstructions[C]// proceedings of the INTERSPEECH, F, ISCA, 2019: 4295-4299.
    [13] Sadeghi M, Alameda-Pineda X. Robust Unsupervised Audio-Visual Speech Enhancement Using a Mixture of Variational Autoencoders[C]// ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.
    [14] Sadeghi M, Alameda-Pineda X. Switching Variational Auto-Encoders for Noise-Agnostic Audio-Visual Speech Enhancement [J]. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, 6663-7.
    [15] Wu Y, Li C, Bai J, et al. Time-Domain Audio-Visual Speech Separation on Low Quality Videos [C]//proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, F, 2022.
    [16] K. He, X. Zhang, S. Ren , et al.Deep Residual Learning for Image Recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA , IEEE, 2016: 770-778.
    [17] Woo S, Park J, Lee J-Y, et al. CBAM: Convolutional Block Attention Module [C]// proceedings of the ECCV. Springer, Cham, F, 2018.
    [18] Afouras T , Chung J S , Zisserman A . LRS3-TED: a large-scale dataset for visual speech recognition[J]. 2018.
    [19] Afouras T, Chung J S, Senior A, et al. Deep audio-visual speech recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2018,44(12):8717-8727.
    [20] Cooke M, Barker J, Cunningham S P, et al. An audio-visual corpus for speech perception and automatic speech recognition [J]. The Journal of the Acoustical Society of America, 2006, 26(7):1290-1302.
    [21] K. He, X. Zhang, S. Ren , et al.Deep Residual Learning for Image Recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA , IEEE, 2016: 770-778.
    [22] Le Roux J, Wisdom S, Erdogan H, et al. SDR–half-baked or well done? [C]//; proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK , IEEE, 2019: 626-630.
    [23] Vincent E, Gribonval R, Févotte C. Performance measurement in blind audio source separation [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(4): 1462-1469.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文
相关视频

分享
文章指标
  • 点击次数:239
  • 下载次数: 0
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2022-12-04
  • 最后修改日期:2023-04-03
  • 录用日期:2023-05-04
文章二维码