Voiceprint recognition based on knowledge distillation and ResNet
CSTR:
Author:
Clc Number:

TP751

  • Article
  • | |
  • Metrics
  • |
  • Reference [23]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    Aiming at the problem of channel mismatch in the field of voiceprint recognition and incomplete acquisition of voiceprint features under short speech or noise conditions,a method that combines traditional methods with deep learning is proposed, and the ResNet model is used as the student model to perform knowledge distillation on the I-Vector model as the teacher model. We construct a ResNet network based on metric learning, introduce an attentive statistics pooling layer, capture and emphasize the important information of voiceprint features, and improve the distinguishability of voiceprint features. The mean square error (MSE) is combined with the loss based on metric learning to reduce computational complexity and enhance model learning capabilities. Finally, the trained model is used for voiceprint recognition test, and compared with the voiceprint recognition model under a variety of deep learning methods. It's found that the equal error rate (EER) is reduced by at least 8%, and the equal error rate has reached 3.229%, indicating that the model can perform speaker verification more effectively.

    Reference
    [1] 郑方, Askar R, 王仁宇, 等. 生物特征识别技术综述[J]. 信息安全研究, 2016, 2(01):12-26. Zheng F, Askar R, Wang R Y, et al. Review of biometric recognition technology[J]. Information Security Research, 2016, 2(01):12-26.(in Chinese)
    [2] Hanifa R M, Isa K, Mohamad S. A review on speaker recognition:Technology and challenges[J].Computers & Electrical Engineering, 2021, 90(4):107005.
    [3] 孙冬梅, 裘正定. 生物特征识别技术综述[J]. 电子学报, 2001(S1):1744-1748. Sun D M, Qiu Z D. Review of biometric recognition technology[J]. Acta Electronica Sinica, 2001(S1):1744-1748.(in Chinese)
    [4] Zhang C,Kazuhito K, Hansen J. Text-independent speaker verification based on triplet convolutional neural network embeddings[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(9):1633-1644.
    [5] Hansen J, Hasan T. Speaker recognition by machines and humans:a tutorial review[J]. IEEE Signal Processing Magazine, 2015, 32(6):74-99.
    [6] 谭萍, 邢玉娟. 噪声环境下文本相关说话人识别方法改进[J]. 西安工程大学学报, 2016, 30(005):639-644. Tan P, Xing Y J. Improvement of text-related speaker recognition method in noise environment[J]. Journal of Xi'an Polytechnic University, 2016, 30(005):639-644.(in Chinese)
    [7] Reynolds D A, Quatieri T F, Dunn R B. Speaker verification using adapted Gaussian mixture models[J]. Digital signal processing, 2000, 10(1-3):19-41.
    [8] Kenny P, Boulianne G, Ouellet P, et al. Joint factor analysis versus eigenchannels in speaker recognition[J]. IEEE Transactions on Audio, Speech and Language Processing, 2007, 15(4):1435-1447.
    [9] Dehak N, Kenny P J, Dehak R, et al. Front-end factor analysis for speaker verification[J]. IEEE Transactions on Audio, Speech and Language Processing, 2011, 19(4):788-798.
    [10] Hinton G, Deng Li, Yu Dong, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J]. IEEE Signal processing magazine, 2012, 29(6):82-97.
    [11] Snyder D, Garcia-Romero D, Povey D, et al. Deep neural network embeddings for text-independent speaker verification[C]//International Speech Communication Association. Proc. Interspeech 2017, August 20-24, 2017. Stockholm, Sweden, France:ISCA, 2017:999-1003.
    [12] 胡青, 刘本永. 基于卷积神经网络的说话人识别算法[J]. 计算机应用, 2016, 36(A01):79-81. Hu Q, Liu B Y. Speaker recognition algorithm based on convolutional neural network[J]. Journal of Computer Applications, 2016, 36(A01):79-81.(in Chinese)
    [13] Nagraniy, A, Chung J S, Zisserman A, et al. VoxCeleb:A large-scale speaker identification dataset[J]. Proceedings of the Annual Conference of the International Speech Communication Association, 2017:2616-2620.
    [14] 郭玥秀, 杨伟, 刘琦, 等. 残差网络研究综述[J]. 计算机应用研究, 2020, 37(5):1292-1297. Guo Y X, Yang W, Liu Q, et al. Research on residual networks[J]. Application Research of Computers, 2020, 37(5):1292-1297.(in Chinese)
    [15] Ko T, Peddinti V, Povey D, et al. A study on data augmentation of reverberant speech for robust speech recognition[C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 5-9, 2017, New Orleans, LA. New Jersey:IEEE, 2017:5220-5224.
    [16] Okabe K, Koshinaka T, Shinoda K. Attentive statistics pooling for deep speaker embedding[J]. Proc. Interspeech 2018, 2018:2252-2256.
    [17] Nagrani A, Chung J S, Xie W, et al. Voxceleb:large-scale speaker verification in the wild[J]. Computer Speech & Language, 2020, 60:101027.
    [18] Shon S, Tang H, Glass J. Frame-level speaker embeddings for text-independent speaker recognition and analysis of End-to-End model[C]//2018 IEEE Spoken Language Technology Workshop (SLT), December 18-21, 2018, Athens, Greece. New Jersey:IEEE, 2018:1007-1013.
    [19] Ravanelli M, Bengio Y. Learning speaker representations with mutual information[C]//International Speech Communication Association. Proc. Interspeech 2019, September 15-19, 2019. Graz, Austria, France:ISCA, 2019:1153-1157.
    [20] Cai W C, Chen J K, Li M. Analysis of length normalization in end-to-end speaker verification system[C]//International Speech Communication Association. Proc. Interspeech 2018, September 2-6, 2018. Hyderabad, India, France:ISCA, 2018:3618-3622.
    [21] Jung J, Heo H S, Kim J, et al. RawNet:advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification[C]//International Speech Communication Association. Proc. Interspeech 2019, September 15-19, 2019. Graz, Austria, France:ISCA, 2019:1268-1272.
    [22] Zhu Y K, Ko T, Snyder D, et al. Self-attentive speaker embeddings for text-independent speaker verification[C]//international speech communication association. September 2-6, 2018. Hyderabad, India, France:ISCA, 2018:3573-3577.
    [23] 陈志高, 李鹏, 肖润秋, 等. 文本无关说话人识别的一种多尺度特征提取方法[J]. 电子与信息学报, 2021, 43(11):3266-3271. Chen Z G, Li P, Xiao R Q, et al. A multi-scale feature extraction method for text-independent speaker recognition[J]. Journal of Electronics and Information Technology, 2021, 43(11):3266-3271.(in Chinese)
    Cited by
Get Citation

荣玉军,方昳凡,田鹏,程家伟.基于知识蒸馏与ResNet的声纹识别[J].重庆大学学报,2023,46(1):113~124

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:July 12,2021
  • Online: February 06,2023
Article QR Code