[关键词]
[摘要]
针对说话人特征提取网络没有考虑语音各帧的差异性而赋予每帧语音相同权重,导致说话人分离效果不够理想的问题,结合残差网络结构在视觉任务和可扩展设计方面的突出性能,提出一种将残差连接、非对称卷积和注意力机制相结合的模型。构建基于ResA2Net模块的说话人分离网络架构对说话人特征进行提取,残差连接结构降低了计算的复杂度,增强了模型学习能力。引入注意力模块,捕获并强调说话人特征的重要信息,提高说话人特征的可区分性。采用近邻传播算法,对说话人聚类。最后,利用训练好的模型进行说话人分离测试,并与多种说话人分离模型对比,分离错误率(Diarization Error Rate,DER)降低至7.34%,低于其他模型,表明该模型能够有效地进行说话人分离。
[Key word]
[Abstract]
The speaker feature extraction network does not consider the differences of speech frame and gives each frame the same weight, which leads to the problem that the speaker diarization result is not ideal. Aiming at this problem and combing with the outstanding performance of residual network structure in visual tasks and scalable design, a method that combines residual connection, asymmetric convolution, and attention mechanism is proposed. The speaker diarization network architecture based on the ResA2Net module was constructed to extract speaker features. Residual connection structure was used to reduce the computational complexity and enhance the model learning ability. Attention module was introduced to capture and emphasize the critical information of speaker's characteristics and improve the distinguishability of speaker's characteristics. The nearest neighbor propagation algorithm was implemented to cluster the speakers. Finally, the trained model was used for speaker diarization test. Compared with various speaker diarization models, the diarization error rate (DER) of the ResA2Net model reaches 7.34%, which is lower than other models, indicating that our model can perform effectively in speaker diarization work.
[中图分类号]
TP391
[基金项目]
教育部-中国移动科研基金(2018)研发项目(MCM20180404)。