Abstract:Multimodal lip recognition aims to enhance speech recognition accuracy and robustness by integrating lip movements and speech information, while also aiding specific user groups in communication. However, existing lip-speaking models predominantly focus on English datasets, leaving research on Chinese lip recognition in its nascent stage. Addressing challenges in handling data features across different modalities, integrating these features, and achieving comprehensive fusion of multimodal features, we propose the Multimodal Split Attention Fusion Audio Visual Recognition (MSAFVR) model. Through experiments utilizing the Chinese Mandarin Lip Reading (CMLR) dataset, our model, MSAFVR, demonstrates significant advancements, achieving a remarkable 92.95% accuracy in Chinese lip reading, surpassing state-of-the-art Mandarin lip reading models.