Abstract:The speaker feature extraction network does not consider the differences of speech frame and gives each frame the same weight, which leads to the problem that the speaker diarization result is not ideal. Aiming at this problem and combing with the outstanding performance of residual network structure in visual tasks and scalable design, a method that combines residual connection, asymmetric convolution, and attention mechanism is proposed. The speaker diarization network architecture based on the ResA2Net module was constructed to extract speaker features. Residual connection structure was used to reduce the computational complexity and enhance the model learning ability. Attention module was introduced to capture and emphasize the critical information of speaker's characteristics and improve the distinguishability of speaker's characteristics. The nearest neighbor propagation algorithm was implemented to cluster the speakers. Finally, the trained model was used for speaker diarization test. Compared with various speaker diarization models, the diarization error rate (DER) of the ResA2Net model reaches 7.34%, which is lower than other models, indicating that our model can perform effectively in speaker diarization work.