基于知识蒸馏与ResNet的声纹识别
作者:
中图分类号:

TP751

基金项目:

教育部-中国移动科研基金资助项目(MCM20180404);国家自然科学基金(52272388)。


Voiceprint recognition based on knowledge distillation and ResNet
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [23]
  • |
  • 相似文献
  • | | |
  • 文章评论
    摘要:

    针对声纹识别领域中存在信道失配与对短语音或噪声条件下声纹特征获取不完全的问题,提出一种将传统方法与深度学习相结合,以I-Vector模型作为教师模型对学生模型ResNet进行知识蒸馏。构建基于度量学习的ResNet网络,引入注意力统计池化层,捕获并强调声纹特征的重要信息,提高声纹特征的可区分性。设计联合训练损失函数,将均方根误差(MSE,mean square error)与基于度量学习的损失相结合,降低计算复杂度,增强模型学习能力。最后,利用训练完成的模型进行声纹识别测试,并与多种深度学习方法下的声纹识别模型比较,等错误率(EER,equal error rate)至少降低了8%,等错误率达到了3.229%,表明该模型能够更有效地进行声纹识别。

    Abstract:

    Aiming at the problem of channel mismatch in the field of voiceprint recognition and incomplete acquisition of voiceprint features under short speech or noise conditions,a method that combines traditional methods with deep learning is proposed, and the ResNet model is used as the student model to perform knowledge distillation on the I-Vector model as the teacher model. We construct a ResNet network based on metric learning, introduce an attentive statistics pooling layer, capture and emphasize the important information of voiceprint features, and improve the distinguishability of voiceprint features. The mean square error (MSE) is combined with the loss based on metric learning to reduce computational complexity and enhance model learning capabilities. Finally, the trained model is used for voiceprint recognition test, and compared with the voiceprint recognition model under a variety of deep learning methods. It's found that the equal error rate (EER) is reduced by at least 8%, and the equal error rate has reached 3.229%, indicating that the model can perform speaker verification more effectively.

    参考文献
    [1] 郑方, Askar R, 王仁宇, 等. 生物特征识别技术综述[J]. 信息安全研究, 2016, 2(01):12-26. Zheng F, Askar R, Wang R Y, et al. Review of biometric recognition technology[J]. Information Security Research, 2016, 2(01):12-26.(in Chinese)
    [2] Hanifa R M, Isa K, Mohamad S. A review on speaker recognition:Technology and challenges[J].Computers & Electrical Engineering, 2021, 90(4):107005.
    [3] 孙冬梅, 裘正定. 生物特征识别技术综述[J]. 电子学报, 2001(S1):1744-1748. Sun D M, Qiu Z D. Review of biometric recognition technology[J]. Acta Electronica Sinica, 2001(S1):1744-1748.(in Chinese)
    [4] Zhang C,Kazuhito K, Hansen J. Text-independent speaker verification based on triplet convolutional neural network embeddings[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(9):1633-1644.
    [5] Hansen J, Hasan T. Speaker recognition by machines and humans:a tutorial review[J]. IEEE Signal Processing Magazine, 2015, 32(6):74-99.
    [6] 谭萍, 邢玉娟. 噪声环境下文本相关说话人识别方法改进[J]. 西安工程大学学报, 2016, 30(005):639-644. Tan P, Xing Y J. Improvement of text-related speaker recognition method in noise environment[J]. Journal of Xi'an Polytechnic University, 2016, 30(005):639-644.(in Chinese)
    [7] Reynolds D A, Quatieri T F, Dunn R B. Speaker verification using adapted Gaussian mixture models[J]. Digital signal processing, 2000, 10(1-3):19-41.
    [8] Kenny P, Boulianne G, Ouellet P, et al. Joint factor analysis versus eigenchannels in speaker recognition[J]. IEEE Transactions on Audio, Speech and Language Processing, 2007, 15(4):1435-1447.
    [9] Dehak N, Kenny P J, Dehak R, et al. Front-end factor analysis for speaker verification[J]. IEEE Transactions on Audio, Speech and Language Processing, 2011, 19(4):788-798.
    [10] Hinton G, Deng Li, Yu Dong, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J]. IEEE Signal processing magazine, 2012, 29(6):82-97.
    [11] Snyder D, Garcia-Romero D, Povey D, et al. Deep neural network embeddings for text-independent speaker verification[C]//International Speech Communication Association. Proc. Interspeech 2017, August 20-24, 2017. Stockholm, Sweden, France:ISCA, 2017:999-1003.
    [12] 胡青, 刘本永. 基于卷积神经网络的说话人识别算法[J]. 计算机应用, 2016, 36(A01):79-81. Hu Q, Liu B Y. Speaker recognition algorithm based on convolutional neural network[J]. Journal of Computer Applications, 2016, 36(A01):79-81.(in Chinese)
    [13] Nagraniy, A, Chung J S, Zisserman A, et al. VoxCeleb:A large-scale speaker identification dataset[J]. Proceedings of the Annual Conference of the International Speech Communication Association, 2017:2616-2620.
    [14] 郭玥秀, 杨伟, 刘琦, 等. 残差网络研究综述[J]. 计算机应用研究, 2020, 37(5):1292-1297. Guo Y X, Yang W, Liu Q, et al. Research on residual networks[J]. Application Research of Computers, 2020, 37(5):1292-1297.(in Chinese)
    [15] Ko T, Peddinti V, Povey D, et al. A study on data augmentation of reverberant speech for robust speech recognition[C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 5-9, 2017, New Orleans, LA. New Jersey:IEEE, 2017:5220-5224.
    [16] Okabe K, Koshinaka T, Shinoda K. Attentive statistics pooling for deep speaker embedding[J]. Proc. Interspeech 2018, 2018:2252-2256.
    [17] Nagrani A, Chung J S, Xie W, et al. Voxceleb:large-scale speaker verification in the wild[J]. Computer Speech & Language, 2020, 60:101027.
    [18] Shon S, Tang H, Glass J. Frame-level speaker embeddings for text-independent speaker recognition and analysis of End-to-End model[C]//2018 IEEE Spoken Language Technology Workshop (SLT), December 18-21, 2018, Athens, Greece. New Jersey:IEEE, 2018:1007-1013.
    [19] Ravanelli M, Bengio Y. Learning speaker representations with mutual information[C]//International Speech Communication Association. Proc. Interspeech 2019, September 15-19, 2019. Graz, Austria, France:ISCA, 2019:1153-1157.
    [20] Cai W C, Chen J K, Li M. Analysis of length normalization in end-to-end speaker verification system[C]//International Speech Communication Association. Proc. Interspeech 2018, September 2-6, 2018. Hyderabad, India, France:ISCA, 2018:3618-3622.
    [21] Jung J, Heo H S, Kim J, et al. RawNet:advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification[C]//International Speech Communication Association. Proc. Interspeech 2019, September 15-19, 2019. Graz, Austria, France:ISCA, 2019:1268-1272.
    [22] Zhu Y K, Ko T, Snyder D, et al. Self-attentive speaker embeddings for text-independent speaker verification[C]//international speech communication association. September 2-6, 2018. Hyderabad, India, France:ISCA, 2018:3573-3577.
    [23] 陈志高, 李鹏, 肖润秋, 等. 文本无关说话人识别的一种多尺度特征提取方法[J]. 电子与信息学报, 2021, 43(11):3266-3271. Chen Z G, Li P, Xiao R Q, et al. A multi-scale feature extraction method for text-independent speaker recognition[J]. Journal of Electronics and Information Technology, 2021, 43(11):3266-3271.(in Chinese)
    相似文献
    引证文献
引用本文

荣玉军,方昳凡,田鹏,程家伟.基于知识蒸馏与ResNet的声纹识别[J].重庆大学学报,2023,46(1):113-124.

复制
分享
文章指标
  • 点击次数:472
  • 下载次数: 763
  • HTML阅读次数: 787
  • 引用次数: 0
历史
  • 收稿日期:2021-07-12
  • 在线发布日期: 2023-02-06
文章二维码