基于MSAF与多模态任务的普通话唇语识别
作者:
作者单位:

1.a.中移杭州信息技术有限公司;2.重庆科技大学 b.智能技术与工程学院;3.重庆邮电大学 c.自动化学院

中图分类号:

TN929??????????


Mandarin lip recognition based on MSAF with multimodal task
Author:
Affiliation:

1.a.China MobileHangzhou InformationTechnology Co.,Ltd.;2.b.School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing 401331,P.R.China;3.c.School of Automation, Chongqing University of Posts and Telecommunications, Chongqing 400065,P.R.China)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    多模态唇语识别的主要功能是通过结合唇部运动和语音信息,提供更准确和稳健的语音识别,以及帮助特定用户群体更好地理解和交流。但现有的唇语模型大都服务于英文数据集,对于中文唇语识别的研究还仅仅存在于起步阶段。针对现有中文唇语识别模型存在如何处理来自不同模态的数据特征、如何组合来自不同模态的数据特征、如何使多模态特征进行充分融合等问题,提出多模态分裂注意力融合视听识别(Multimodal Split Attention Fusion Audio Visual Recognition,MSAFAVR)模型。在基于中文唇语识别数据集(Chinese Mandarin Lip Reading,CMLR)的实验表明该模型在中文唇语识别方面的准确率达到了92.95%,与现有的SOTA普通话唇语识别模型进行对比,达到了最佳。

    Abstract:

    Multimodal lip recognition aims to enhance speech recognition accuracy and robustness by integrating lip movements and speech information, while also aiding specific user groups in communication. However, existing lip-speaking models predominantly focus on English datasets, leaving research on Chinese lip recognition in its nascent stage. Addressing challenges in handling data features across different modalities, integrating these features, and achieving comprehensive fusion of multimodal features, we propose the Multimodal Split Attention Fusion Audio Visual Recognition (MSAFVR) model. Through experiments utilizing the Chinese Mandarin Lip Reading (CMLR) dataset, our model, MSAFVR, demonstrates significant advancements, achieving a remarkable 92.95% accuracy in Chinese lip reading, surpassing state-of-the-art Mandarin lip reading models.

    参考文献
    相似文献
    引证文献
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-06-14
  • 最后修改日期:2025-01-06
  • 录用日期:2025-01-14
文章二维码