[关键词]
[摘要]
句子情感分类任务致力于挖掘文本中的情感语义,目前以基于BERT的深度网络模型表现最佳。然而这类模型的性能极度依赖于大量的高质量标注数据,而现实中标注样本往往比较稀缺,导致深度神经网络(DNN)很容易在小规模样本集上过拟合,难以准确捕捉句子的隐含情感特征。尽管现有的半监督模型有效利用了未标注样本的特征,但对引入未标注样本可能导致错误逐渐累积问题仍然没有得到有效处理。同时,半监督模型在对测试数据集进行预测后不会重新评估和修正上一次的标注结果,因此无法充分挖掘测试数据上的特征信息。为解决这些问题,本文提出了一种新型的半监督句子情感分类模型。该模型首先提出基于K-近邻算法的权重机制,为置信度高的样本分配较高的权重,以尽可能减少错误信息在模型训练过程中的传播。接着,采用了两阶段训练策略,使模型能够对测试数据中预测错误的样本进行及时修正。最后通过在多个数据集上的测试,证明了本模型即使在小规模样本集上也能获得良好的性能。
[Key word]
[Abstract]
Sentence sentiment classification is an important application task for mining the emotional semantics of text. Currently, the best sentence sentiment classification tool is based on a deep neural network model using BERT. However, its performance heavily relies on a large amount of high-quality labeled training data. In reality, the labeled data is usually scarce, leading to overfitting of deep neural networks on small datasets, which makes it difficult to capture the implicit sentiment features of sentences. Although existing semi-supervised models make full use of the features on a large number of unlabeled samples, they still suffer from the problem of introducing errors from pseudo-labeled unlabeled samples, and once the test data is labeled, the model does not consider further utilizing the feature information in the test data. Therefore, this paper proposes a semi-supervised sentence sentiment classification model. First, a weight mechanism that combines k-nearest neighbors is designed to give higher weights to samples with higher confidence to minimize the propagation of error information during parameter learning. Secondly, a two-stage training mechanism is designed to allow the model to make timely corrections to misclassified samples in the test data. We have conducted extensive experiments on multiple datasets, and the results show that this method can achieve good performance on small datasets.
[中图分类号]
TP311???????
[基金项目]
国家自然科学基金