Abstract:The spectral patterns of vocal and accompaniment have their own unique textures, but the spectral lines of vocal and accompaniment often overlap and intertwine on the spectrogram, making it very difficult to separate vocal and accompaniment from mono audio. Therefore, a stacked hourglass network that integrates multi-resolution attention and multi-channel cross-attention is proposed to finely characterize the texture features of the spectral lines of vocals and accompaniment. First, in response to the differences in spectral line density between vocal and accompaniment in the frequency dimension of the spectrogram, multi-resolution attention is applied to the features of different resolutions in the decoder, so as to utilize the appropriate resolution to represent the time-frequency texture patterns of vocal and accompaniment. Secondly, multi-channel cross-attention is proposed to better represent the instantaneous time-frequency characteristics in the frequency dimension and the flat sustained characteristics in the time dimension of the accompaniment, effectively extracting the spectrogram features of the accompaniment. Experimental results on the MIR-1K dataset show that compared with the current state-of-the-art model SHN, the number of parameters is reduced by about 33%, the vocal signal-to-noise ratio (GNSDR) index is improved by 1.35 dB, and the accompaniment is improved by 0.89 dB. The experimental results prove that a full representation of the spectrogram features of different sound sources can further improve the separation effect of vocal and accompaniment.