时频谱线纹理模式引导的歌声分离
作者单位:

天津大学 微电子学院

基金项目:

国家自然科学基金项目(面上项目,重点项目,重大项目),天津市自然科学基金重点项目


Spectrogram texture pattern guided vocal separation
Affiliation:

tianjin university

Fund Project:

The National Natural Science Foundation of China (General Program, Key Program, Major Research Plan),

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    人声和背景音乐的谱线具有各自独特的纹理模式,但是在时频谱上人声与背景音乐的谱线间常呈现重叠交织的现象,因此从单声道音频中分离人声和背景音乐是一项非常困难的任务,因此提出一种融合多分辨率注意力和多通道交叉注意力的堆叠沙漏网络,来精细表征人声与背景音乐谱线的纹理模式特点。首先,针对时频谱图中频率维度上人声和背景音乐谱线密度差异,对解码器中不同分辨率的特征采用了多分辨率注意力,以利用合适的分辨率表征人声和背景音乐的时频纹理模式。其次,提出多通道交叉注意力,以更好地表示背景音乐频率维度上的瞬时时频特征和时间维度上平直的持续特性,有效的提取背景音乐的时频谱特征。在MIR-1K数据集上的实验结果表明,与当前最先进的模型SHN相比,参数量减小约33%,人声信噪比(GNSDR)指标提升1.35 dB,背景音乐提升0.89 dB,实验结果证明对不同声源时频谱特征的充分表示,可以进一步改善人声和背景音乐分离效果。

    Abstract:

    The spectral patterns of vocal and accompaniment have their own unique textures, but the spectral lines of vocal and accompaniment often overlap and intertwine on the spectrogram, making it very difficult to separate vocal and accompaniment from mono audio. Therefore, a stacked hourglass network that integrates multi-resolution attention and multi-channel cross-attention is proposed to finely characterize the texture features of the spectral lines of vocals and accompaniment. First, in response to the differences in spectral line density between vocal and accompaniment in the frequency dimension of the spectrogram, multi-resolution attention is applied to the features of different resolutions in the decoder, so as to utilize the appropriate resolution to represent the time-frequency texture patterns of vocal and accompaniment. Secondly, multi-channel cross-attention is proposed to better represent the instantaneous time-frequency characteristics in the frequency dimension and the flat sustained characteristics in the time dimension of the accompaniment, effectively extracting the spectrogram features of the accompaniment. Experimental results on the MIR-1K dataset show that compared with the current state-of-the-art model SHN, the number of parameters is reduced by about 33%, the vocal signal-to-noise ratio (GNSDR) index is improved by 1.35 dB, and the accompaniment is improved by 0.89 dB. The experimental results prove that a full representation of the spectrogram features of different sound sources can further improve the separation effect of vocal and accompaniment.

    参考文献
    相似文献
    引证文献
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-07-20
  • 最后修改日期:2024-11-24
  • 录用日期:2025-02-13
文章二维码