Abstract:Food image segmentation plays an important role in the field of food volume estimation, but there is still much room for improvement in its performance due to the fine structure of food and some challenges in shooting, such as blurred boundaries and image overexposure. To solve these above problems, a complementary fusion RGB-D Food Image segmentation Network (RGB-D ABCFNet) based on attention mechanism is proposed. The network adopts U-shaped structure and is divided into encoding process and decoding process. In the coding process, the Expand Head Channel Attention Module (EHCAM) proposed extracts the channel features that are more helpful to segmentation of the depth map, so that the characteristics of depth map are well complemented to RGB feature map by adding layer by layer. In decoding process, the Multi-Head Spatial Attention Module (MHSAM) present enables the detailed information and location information to be well recovered, and the extracted semantic features can better map the semantic segmentation results. In addition, a multi-class food semantic segmentation dataset Nutrition-Pix is constructed and a large number of comparison and ablation experiments are conducted on it, proving that the proposed model is superior to the current method with the mIoU of 87.5%.