基于注意力残差网络的说话人分离
DOI:
CSTR:
作者:
作者单位:

1.重庆邮电大学 自动化学院;2.重庆大学 建设管理与房地产学院;3.中移杭州信息技术有限公司

作者简介:

通讯作者:

中图分类号:

TP391

基金项目:

教育部-中国移动科研基金(2018)研发项目(MCM20180404)。


Speaker diarization based on attention residual network
Author:
Affiliation:

1.School of Automation,Chongqing University of Posts and Telecommunications;2.School of Management Science and Real Estate,Chongqing University;3.China Mobile Hangzhou Information Technology Co,LTD

Fund Project:

Supported by the Ministry of Education - Mobile Research Fundation of China(MCM20180404)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对说话人特征提取网络没有考虑语音各帧的差异性而赋予每帧语音相同权重,导致说话人分离效果不够理想的问题,结合残差网络结构在视觉任务和可扩展设计方面的突出性能,提出一种将残差连接、非对称卷积和注意力机制相结合的模型。构建基于ResA2Net模块的说话人分离网络架构对说话人特征进行提取,残差连接结构降低了计算的复杂度,增强了模型学习能力。引入注意力模块,捕获并强调说话人特征的重要信息,提高说话人特征的可区分性。采用近邻传播算法,对说话人聚类。最后,利用训练好的模型进行说话人分离测试,并与多种说话人分离模型对比,分离错误率(Diarization Error Rate,DER)降低至7.34%,低于其他模型,表明该模型能够有效地进行说话人分离。

    Abstract:

    The speaker feature extraction network does not consider the differences of speech frame and gives each frame the same weight, which leads to the problem that the speaker diarization result is not ideal. Aiming at this problem and combing with the outstanding performance of residual network structure in visual tasks and scalable design, a method that combines residual connection, asymmetric convolution, and attention mechanism is proposed. The speaker diarization network architecture based on the ResA2Net module was constructed to extract speaker features. Residual connection structure was used to reduce the computational complexity and enhance the model learning ability. Attention module was introduced to capture and emphasize the critical information of speaker's characteristics and improve the distinguishability of speaker's characteristics. The nearest neighbor propagation algorithm was implemented to cluster the speakers. Finally, the trained model was used for speaker diarization test. Compared with various speaker diarization models, the diarization error rate (DER) of the ResA2Net model reaches 7.34%, which is lower than other models, indicating that our model can perform effectively in speaker diarization work.

    参考文献
    相似文献
    引证文献
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-09-25
  • 最后修改日期:2021-11-14
  • 录用日期:2021-11-24
  • 在线发布日期:
  • 出版日期:
文章二维码