基于注意力残差网络的说话人分离

基于注意力残差网络的说话人分离
DOI:
                        
                    
CSTR:
                        
                    
作者:
                        方昳凡1方昳凡
重庆邮电大学 自动化学院
在期刊界中查找
在百度中查找
在本站中查找
田 鹏1田 鹏
重庆邮电大学 自动化学院
在期刊界中查找
在百度中查找
在本站中查找
袁凡宁2袁凡宁
重庆大学 建设管理与房地产学院
在期刊界中查找
在百度中查找
在本站中查找
荣玉军3荣玉军
中移杭州信息技术有限公司
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:1.重庆邮电大学 自动化学院;2.重庆大学 建设管理与房地产学院;3.中移杭州信息技术有限公司
作者简介:
通讯作者:
中图分类号:TP391
基金项目:教育部-中国移动科研基金(2018)研发项目(MCM20180404)。

Speaker diarization based on attention residual network

Author:

FANG Yi-fan ^¹
FANG Yi-fan
School of Automation,Chongqing University of Posts and Telecommunications
在期刊界中查找
在百度中查找
在本站中查找
TIAN Peng ^¹
TIAN Peng
School of Automation,Chongqing University of Posts and Telecommunications
在期刊界中查找
在百度中查找
在本站中查找
YUAN Fan-ning ^²
YUAN Fan-ning
School of Management Science and Real Estate,Chongqing University
在期刊界中查找
在百度中查找
在本站中查找
RONG Yu-jun ^³
RONG Yu-jun
China Mobile Hangzhou Information Technology Co,LTD
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

1.School of Automation,Chongqing University of Posts and Telecommunications;2.School of Management Science and Real Estate,Chongqing University;3.China Mobile Hangzhou Information Technology Co,LTD

Fund Project:

Supported by the Ministry of Education - Mobile Research Fundation of China（MCM20180404）

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

针对说话人特征提取网络没有考虑语音各帧的差异性而赋予每帧语音相同权重，导致说话人分离效果不够理想的问题，结合残差网络结构在视觉任务和可扩展设计方面的突出性能，提出一种将残差连接、非对称卷积和注意力机制相结合的模型。构建基于ResA2Net模块的说话人分离网络架构对说话人特征进行提取，残差连接结构降低了计算的复杂度，增强了模型学习能力。引入注意力模块，捕获并强调说话人特征的重要信息，提高说话人特征的可区分性。采用近邻传播算法，对说话人聚类。最后，利用训练好的模型进行说话人分离测试，并与多种说话人分离模型对比，分离错误率(Diarization Error Rate，DER)降低至7.34%，低于其他模型，表明该模型能够有效地进行说话人分离。

关键词:深度学习;残差连接;注意力机制;卷积神经网络;说话人分离

Abstract:

The speaker feature extraction network does not consider the differences of speech frame and gives each frame the same weight, which leads to the problem that the speaker diarization result is not ideal. Aiming at this problem and combing with the outstanding performance of residual network structure in visual tasks and scalable design, a method that combines residual connection, asymmetric convolution, and attention mechanism is proposed. The speaker diarization network architecture based on the ResA2Net module was constructed to extract speaker features. Residual connection structure was used to reduce the computational complexity and enhance the model learning ability. Attention module was introduced to capture and emphasize the critical information of speaker's characteristics and improve the distinguishability of speaker's characteristics. The nearest neighbor propagation algorithm was implemented to cluster the speakers. Finally, the trained model was used for speaker diarization test. Compared with various speaker diarization models, the diarization error rate (DER) of the ResA2Net model reaches 7.34%, which is lower than other models, indicating that our model can perform effectively in speaker diarization work.

Key words:deep learning; residual connections; attention mechanism; convolutional neural networks; speaker diarization

引用本文

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2021-09-25
最后修改日期:2021-11-14
录用日期:2021-11-24
在线发布日期:
出版日期:

期刊社主页

编辑部首页

期刊介绍

编委会

数据库收录

过刊浏览

联系我们

引用本文

相关视频

分享

文章指标

历史

文章二维码

期刊社主页

编辑部首页

期刊介绍

编委会

数据库收录

过刊浏览

联系我们

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码