基于注意力残差网络的说话人分离
作者:
作者单位:

1.重庆邮电大学 自动化学院;2.重庆大学 建设管理与房地产学院;3.中移杭州信息技术有限公司

中图分类号:

TP391

基金项目:

教育部-中国移动科研基金(2018)研发项目(MCM20180404)。


Speaker diarization based on attention residual network
Author:
Affiliation:

1.School of Automation,Chongqing University of Posts and Telecommunications;2.School of Management Science and Real Estate,Chongqing University;3.China Mobile Hangzhou Information Technology Co,LTD

Fund Project:

Supported by the Ministry of Education - Mobile Research Fundation of China(MCM20180404)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [28]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    针对说话人特征提取网络没有考虑语音各帧的差异性而赋予每帧语音相同权重,导致说话人分离效果不够理想的问题,结合残差网络结构在视觉任务和可扩展设计方面的突出性能,提出一种将残差连接、非对称卷积和注意力机制相结合的模型。构建基于ResA2Net模块的说话人分离网络架构对说话人特征进行提取,残差连接结构降低了计算的复杂度,增强了模型学习能力。引入注意力模块,捕获并强调说话人特征的重要信息,提高说话人特征的可区分性。采用近邻传播算法,对说话人聚类。最后,利用训练好的模型进行说话人分离测试,并与多种说话人分离模型对比,分离错误率(Diarization Error Rate,DER)降低至7.34%,低于其他模型,表明该模型能够有效地进行说话人分离。

    Abstract:

    The speaker feature extraction network does not consider the differences of speech frame and gives each frame the same weight, which leads to the problem that the speaker diarization result is not ideal. Aiming at this problem and combing with the outstanding performance of residual network structure in visual tasks and scalable design, a method that combines residual connection, asymmetric convolution, and attention mechanism is proposed. The speaker diarization network architecture based on the ResA2Net module was constructed to extract speaker features. Residual connection structure was used to reduce the computational complexity and enhance the model learning ability. Attention module was introduced to capture and emphasize the critical information of speaker's characteristics and improve the distinguishability of speaker's characteristics. The nearest neighbor propagation algorithm was implemented to cluster the speakers. Finally, the trained model was used for speaker diarization test. Compared with various speaker diarization models, the diarization error rate (DER) of the ResA2Net model reaches 7.34%, which is lower than other models, indicating that our model can perform effectively in speaker diarization work.

    参考文献
    [1] Xavier Anguera Miró,Simon Bozonnet,Nicholas W. D. Evans,Corinne Fredouille,Gerald Friedland,Oriol Vinyals. Speaker Diarization: A Review of Recent Research.[J]. IEEE Trans. Audio, Speech Language Processing,2012,20(2):
    [2] 袁哲菲,张连海,杨绪魁,等.基于改进自注意力机制的说话人分割聚类[J].信息工程大学学报,2020,21(05):539-544.uan Z F, Zhang L H, Yang X K, et al. Speaker segmentation and clustering based on improved self-attention mechanism[J]. Journal of Information Engineering University, 2020, 21(05): 539-544.
    [3] Heigold G, Moreno I, Bengio S, et al. End-to-end text-dependent speaker verification[C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016: 5115-5119.
    [4] Wang Q, Downey C, Wan L, et al. Speaker diarization with LSTM[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 5239-5243.
    [5] Snyder D, Garcia-Romero D, Sell G, et al. X-vectors: Robust dnn embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 5329-5333.
    [6] Dupuy G, Rouvier M, Meignier S, et al. I-vectors and ILP clustering adapted to cross-show speaker diarization[C]//Interspeech. 2012.
    [7] Snyder D, Ghahremani P, Povey D, et al. Deep neural network-based speaker embeddings for end-to-end speaker verification[C]//2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016: 165-170.
    [8] 郭玥秀,杨伟,刘琦,等.残差网络研究综述[J].计算机应用研究,2020,37(05):1292-1297.uo Y X, Yang W, Liu Q, et al.Summary of residual network research[J].Application Research of Computers,2020,37(05):1292-1297.
    [9] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
    [10] Nagrani A , Chung J S , Zisserman A . VoxCeleb: a large-scale speaker identification dataset[C]// Interspeech. 2017.
    [11] Chung J S, Nagrani A, Zisserman A. Voxceleb2: Deep speaker recognition[J]. 2018.
    [12] Cai W , Chen J , Ming L . Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System[C]// Odyssey 2018.2018.
    [13] Zhou T, Zhao Y, Li J, et al. CNN with phonetic attention for text-independent speaker verification[C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019: 718-725.
    [14] Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4700-4708.
    [15] Xie S, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1492-1500.
    [16] Gao S, Cheng M M, Zhao K, et al. Res2net: A new multi-scale backbone architecture[J]. IEEE transactions on pattern analysis and machine intelligence, 2019.
    [17] Zhou T, Zhao Y, Wu J. ResNeXt and Res2Net structures for speaker verification[C]//2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021: 301-307.
    [18] Bredin H, Yin R, Coria J M, et al. Pyannote. audio: neural building blocks for speaker diarization[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 7124-7128.
    [19] Ding X, Guo Y, Ding G, et al. ACNet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 1911-1920.
    [20] Xiao X, Kanda N, Chen Z, et al. Microsoft speaker diarization system for the VoxCeleb speaker recognition challenge 2020[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 5824-5828.
    [21] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in neural information processing systems. 2017: 5998-6008.
    [22] Misra D. Mish: A self regularized non-monotonic neural activation function[J]. 2019, 4: 2.
    [23] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.
    [24] 郑艳, 姜源祥. 基于特征融合的说话人聚类算法[J]. 东北大学学报 (自然科学版), 42(7): 952.heng Y, Jiang Y X. Speaker clustering algorithm based on feature fusion[J]. Journal of Northeastern University (Natural Science Edition), 42(7): 952.
    [25] Frey B J, Dueck D. Clustering by passing messages between data points[J]. science, 2007, 315(5814): 972-976.
    [26] Carletta J, Ashby S, Bourban S, et al. The AMI meeting corpus: A pre-announcement[C]//International workshop on machine learning for multimodal interaction. Springer, Berlin, Heidelberg, 2005: 28-39.
    [27] Kingma D , Ba J . Adam: A Method for Stochastic Optimization[J]. Computer Science, 2014.
    [28] Fiscus J G , Doddington G R , Le A , et al. 2003 NIST Rich Transcription Evaluation Data[J].
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-09-25
  • 最后修改日期:2021-11-14
  • 录用日期:2021-11-24
文章二维码