Mandarin lip recognition based on MSAF with multimodal task
DOI:
CSTR:
Author:
Affiliation:

1.a.China MobileHangzhou InformationTechnology Co.,Ltd.;2.b.School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing 401331,P.R.China;3.c.School of Automation, Chongqing University of Posts and Telecommunications, Chongqing 400065,P.R.China)

Clc Number:

TN929??????????

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Multimodal lip recognition aims to enhance speech recognition accuracy and robustness by integrating lip movements and speech information, while also aiding specific user groups in communication. However, existing lip-speaking models predominantly focus on English datasets, leaving research on Chinese lip recognition in its nascent stage. Addressing challenges in handling data features across different modalities, integrating these features, and achieving comprehensive fusion of multimodal features, we propose the Multimodal Split Attention Fusion Audio Visual Recognition (MSAFVR) model. Through experiments utilizing the Chinese Mandarin Lip Reading (CMLR) dataset, our model, MSAFVR, demonstrates significant advancements, achieving a remarkable 92.95% accuracy in Chinese lip reading, surpassing state-of-the-art Mandarin lip reading models.

    Reference
    Related
    Cited by
Get Citation
Related Videos

Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:June 14,2024
  • Revised:January 06,2025
  • Adopted:January 14,2025
  • Online:
  • Published:
Article QR Code