基于MSAF与多模态任务的普通话唇语识别
作者:
作者单位:

1.a.中移杭州信息技术有限公司;2.重庆科技大学 b.智能技术与工程学院;3.重庆邮电大学 c.自动化学院

中图分类号:

TN929??????????


Mandarin lip recognition based on MSAF with multimodal task
Author:
Affiliation:

1.a.China MobileHangzhou InformationTechnology Co.,Ltd.;2.b.School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing 401331,P.R.China;3.c.School of Automation, Chongqing University of Posts and Telecommunications, Chongqing 400065,P.R.China)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [15]
  • | |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    多模态唇语识别的主要功能是通过结合唇部运动和语音信息,提供更准确和稳健的语音识别,以及帮助特定用户群体更好地理解和交流。但现有的唇语模型大都服务于英文数据集,对于中文唇语识别的研究还仅仅存在于起步阶段。针对现有中文唇语识别模型存在如何处理来自不同模态的数据特征、如何组合来自不同模态的数据特征、如何使多模态特征进行充分融合等问题,提出多模态分裂注意力融合视听识别(Multimodal Split Attention Fusion Audio Visual Recognition,MSAFAVR)模型。在基于中文唇语识别数据集(Chinese Mandarin Lip Reading,CMLR)的实验表明该模型在中文唇语识别方面的准确率达到了92.95%,与现有的SOTA普通话唇语识别模型进行对比,达到了最佳。

    Abstract:

    Multimodal lip recognition aims to enhance speech recognition accuracy and robustness by integrating lip movements and speech information, while also aiding specific user groups in communication. However, existing lip-speaking models predominantly focus on English datasets, leaving research on Chinese lip recognition in its nascent stage. Addressing challenges in handling data features across different modalities, integrating these features, and achieving comprehensive fusion of multimodal features, we propose the Multimodal Split Attention Fusion Audio Visual Recognition (MSAFVR) model. Through experiments utilizing the Chinese Mandarin Lip Reading (CMLR) dataset, our model, MSAFVR, demonstrates significant advancements, achieving a remarkable 92.95% accuracy in Chinese lip reading, surpassing state-of-the-art Mandarin lip reading models.

    参考文献
    [1] Yang ke Li,ZhangXinman.Lip landmark-based audio-visual speech enhancement with multimodal feature fusionnetwork.https://doi.org/10.1016/j.neucom.2023.126432
    [2] Dupont,S.& Luettin, J. Audio-visual speech modeling for continuous speech recognition.IEEE Transactions on Multimedia 2, 141–151 (2000).
    [3] 张晓冰. 基于深度学习的唇语识别及其模型压缩的研究与应用[D].电子科技大学,2023.DOI:10.27005/d.cnki.gdzku.2022.000048.
    [4] 何清源. 基于深度学习唇语与语音多模态融合识别研究[D].燕山大学,2023.DOI:10.27440/d.cnki.gysdu.2022.000264.
    [5] 周甜.基于注意力机制和时间卷积网络的语言知识别法研究[D].重庆邮电大学,2022.DOI:10.27675/d.cnki.gcydx.2022.000166.
    [6] Zhang X., Song Y., Song T., Yang D., Ye Y., Zhou J., & Zhang L. (2023).AKConv:Convolutional Kernel with Arbitrary Sampled Shapes and Arbitrary Number of Parameters. ar**v preprint ar**v:2311.11587.
    [7] Zhao Y., Xu, Song M.R. & A cas-cade sequence-to sequence model for chinese mandarin lip reading.In Proceedings of the 1st ACM Int-ernational Conference on Multime-dia in Asia, 1–6 (2019).
    [8] Deng, al J.et. Retinaface: Single-stage dense face Localisation in the wild. In Proceedings of The 33rd IEEE/CVFConference on Co-mputer Vision and Pattern Recogn-ition, 5203–5212 (2020).
    [9] Bulat A. & Tzimiropoulos G. Howfar are we fromsolving the 2d & 3d face alignment problem? (and adataset of 230,000 3facial lan-dmarks)d. In Proceedings of the 16th IEEE/CVF International Con-ferenceon Computer Vision (2017).
    [10] 刘景勇,柴佩琪,姚秋明.汉语TTS系统中多音字问题的一种有效解决方案[J].微型电脑应用,2005,(04):52-55+66.
    [11] CHUNG J S,VINYALSSENOR A 0 et alLip Reding Sentences in the Wid /Proc of the lEEE Conerence on Com-puter Vision and Patem Recogniton.Washington,USA:EEE 2017-34-3450
    [12] ZHANG X B,GONG H G,Let alDAl X.Understanding Pictograph with Facial Features:End-to-End Sentence-Level Lip Readina of Chinese.Procedings of the AAA Conference on Artificial Intelligence,2019,33(1):9211-9218.
    [13] ZHAO Y,XU R,SONG M L.A Cascade Sequence-to-Sequence Model for Ch-inese Mandarin Lip Reading//Proc of the ACM Multimedia Asia.NewYork,USA:ACM,2019.DO1:10.1145/3338533.3366579.
    [14] ZHAO YXU R,WANG C,et al.Hearing Lips:lmproving Lip Reading by Distilling Speech Recognizers.Procee-dings of the AAAI Conference on Artificial Intelligence.2020.34(4):6917-6924.
    [15] Watanabe,alS.et.ESPnet: End-to-end speech processing toolkit. InProceedings of the 19th Annual Co-nference of International Speech-Communication Association, 2207–2211 (2018).Kingma, D. & Ba,J. Adam: A method for stochastic optimization. In Proceedings of the 2nd International Conference on Learning Representations (2014).
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文
分享
文章指标
  • 点击次数:111
  • 下载次数: 0
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2024-06-14
  • 最后修改日期:2025-01-06
  • 录用日期:2025-01-14
文章二维码