摘要
针对双关语样本短缺问题,研究提出了基于伪标签和迁移学习的双关语识别模型 (pun detection based on Pseudo-label and transfer learning)。该模型利用上下文语义、音素向量和注意力机制生成伪标签;然后,迁移学习和置信度结合挑选可用的伪标签;最后,将伪标签数据和真实数据混合到网络中进行训练,重复伪标签标记和混合训练过程。一定程度上解决了双关语样本量少且获取困难的问题。使用该模型在SemEval 2017 shared task 7以及Pun of the Day 数据集上进行双关语检测实验,结果表明模型性能均优于现有主流双关语识别方法。
随着社交媒体不断发展,人们在网络上创作了大量幽默内容。幽默的结构往往十分复杂,且依赖真实世界知识。在自然语言中,常见的修辞方法双关语是幽默的一种重要表现形式。双关语是将词语的真正含义模糊化,使同一个句子有2种或者多种释义,使文本产生不同程度的敏感性。双关语是著名文学、广告和演讲中幽默来源的标准修辞手法。例如,它常常作为一种幽默手段被用于广告中,引发听众联想双关语中的潜在表达,既能引人注意又能产生联想,加深记
双关语的经典分类是谐音双关语和语义双关
双关语样例 |
---|
语义双关语 |
1. What’s the longest sentence in the world? Life sentence. 2. Better late than the late. |
谐音双关语 |
3. Seven days without water makes one weak(week). 4. A bicycle can't stand on its own because it is two-tyred(too tired) |
随着深度神经网络的发展,现有双关语识别模型算法大多基于神经网络:例如,刁宇峰
笔者提出一种基于伪标签和迁移学习的双关语识别模型 (pun detection based on pseudo-label and transfer learning,PDPTL)。利用未标签数据重叠信息在同类数据中寻找更为通用的特征,使用迁移学习和置信度结合挑选可用的伪标签,重复伪标签数据与混合训练过程,一定程度缓解双关语数据样本稀缺和模型泛化能力的问题。经过实验,PCPRPL在公开数据集的预测效果获得比较明显提高,且优于目前已知方法。
双关语任务涉及到双关语识别与生成,研究主要运用伪标签和迁移学习技术为解决双关语任务提供新方法。
Pedersen
Lee
Google AI 的Qizhe Xie
迁移学习(transfer learning)旨在通过迁移包含在不同但相关源域中的知识提高目标学习者在目标域上的表现,减少构建目标学习器对大量目标域数据的依
构建研究模型:基于伪标签和迁移学习的双关语识别模型PDPTL。
遵循Zhou等对于任务的定义,对于一段含有N个词的文本。每个词具有个音素,根据发音,可表示为,表示文本中第i个词的第j个音素。这些音素是由CMU 发音字典(CMU pronouncing dictionary
基础模型:PDPTL选用PCPR作为基础模型。模型使用BER
对于词语的每个音素使用Keras的Embedding层投影为维向量,之后通过局部注意力机制(local-attention mechanism
, | (1) |
, | (2) |
, | (3) |
式中:是输出维向量的全连接层;是的重要分数;是用来评估每个语音嵌入重要性的维向量,是模型定义的局部注意力机制的大小。
通过拼接上下文语义向量和语音嵌入向量(pronunciation embedding vector)生成(=+维向量)并运用自注意机制(Self-attention Mechanism
, | (4) |
, | (5) |
, | (6) |
, | (7) |
式中:是用来估算注意力的函数;是每个单词的重要分数;是一个缩放系数,为了避免过小的梯度。最后拼接与生成输入文本的整体特征即语音联合上下文语义向量
, | (8) |
预测标签由采用softmax激活函数的全连接层给出
, | (9) |
式中,生成二元分类中两类的值。
伪标签:先前的伪标签学习方法筛选伪标签的策略通常为选取高置信度的样本。策略的依据是聚类假设,即高置信度样本在相同类别的可能性较大。具体步骤为设定confidence_coefficient这一置信度阈值,只有生成的伪标签概率大于confidence_coefficient时,模型才会将其加入训练数据中。
概率由以下公式得出
, | (10) |
但这样的策略,一方面阈值的确定过于依赖人工实验,另一方面忽视了潜藏的危险“高置信度的陷阱”——模型所认为的高置信度样本并不一定可靠,最终导致高置信度的错误样本加入到了模型训练过程中。为了筛选出更加可靠的样本,模型在高置信度策略基础上结合迁移学习方法中的MMD(maximum mean discrepancy
MMD是由Gretton 等人提出,用于度量2个数据集分布的匹配程度,常用于检测双关样本问题。度量值代表2个数据集分布在再生希尔伯特空间(reproducing kernel Hilbert space,RKHS)中的距离,度量值越小,则距离越近,代表2个分布越相似,MMD的计算公式如下
。 | (11) |
本模型的伪标签样本筛选策略,给定置信度阈值confidence_coefficient一个初始值,置信度阈值以一定步幅(speed)增长,计算在当前置信度阈值下筛选得出的伪标签数据(Pseudo_label_data)与训练数据(labeled_data)的MMD距离,将其中MMD距离最小的阈值作为最终置信度阈值,由此筛选出最终伪标签数据(Pseudo_label_data),标记伪标签,加入训练。为了保证模型能尽可能学到正确知识及从有标签数据中学习到足够知识,笔者采用了加权损失函数,即在批次前对带有伪标签的数据权重设置为零后慢慢增加,直到批次保持不变为常数weight。
(12) |
损失函数为交叉熵损失函数,真实训练数据(labeled_data)和伪标签数据(Pseudo_label_data)将会分开计算损失值,最后如下加权合并得出最终损失Loss
(13) |
PDPTL:

图1 PDPTL框架图
Fig. 1 The frame work of PDPTL
1) 通过有标签数据训练基础模型,得到已训练模型;
2) 已训练模型对无标签数据进行预测获得带有伪标签的数据;
3) 将有标签数据和筛选后的伪标签数据混合取代有标签数据重新训练基础模型,进入下一轮。
根据以上阐述,算法1展示了PDPTL的总体流程。
算法1
/*
times循环更新pseudo_labels的次数
Base_Model基础模型
num_train_epochs模型训练批次
eval_data无标签数据。
eval()评估函数输入模型和无标签数据输出伪标签数据
confidence_coefficient 初始阈值
Best_MMD 最小的MMD距离
Best_confidence_coefficient 最佳阈值
speed 阈值增加步幅
*/
for index<-0 to times:/*times循环更新pseudo_labels的次数*/
{
init Base_Model/*Base_Model基础模型*/
for epoch<-0 to num_train_epochs:/*num_train_epochs模型训练批次*/
{
train Base_Model with train_data_with_label /*使用训练数据训练Base_Model*/
}
data_with_pseudo_labels <- eval(Base_Model,eval_data)
/*eval_data无标签数据。eval()评估函数输入模型和无标签数据输出伪标签数据*/
init train_data_with_label
/*初始化训练数据,即去除上一轮加入的伪标签数据*/
Now_confidence_coefficient = confidence_coefficient
While Now_confidence_coefficient <= 1:
{
for data_with_pseudo_label in data_with_pseudo_labels:/*遍历每一条伪标签数据*/
{
if probability of data_with_pseudo_label larger than Now_confidence_coefficient:
/*判断的概率大于置信度confidence_coefficient*/
add data_with_pseudo_label to pseudo _data_with_label/*将伪标签数据加入伪标签数据集中*/
}
MDD = getMDD(train_data_with_label,pseudo _data_with_label)/*获取当前伪标签数据集与训练数据集的MDD*/
if Now_ confidence_coefficient == confidence_coefficient:
Best_MDD = MDD
else:
if Best_MDD < MDD:/*距离变小则更新*/
{
Best_MDD = MDD
Best_confidence_coefficient = Now_confidence_coefficient/*更新最佳阈值和最佳伪标签数据集*/
best_pseudo_data_with_label = pseudo_data_with_label
}
init pseudo _data_with_label/*初始化当前伪标签数据集,即清空
Now_confidence_coefficient = Now_confidence_coefficient + speed/*按照speed递增*/
}
Add best_pseudo_data_with_label to train_data_with_label
}
展示实验相关设置,将PDPTL模型与其他经典算法在2个公开数据集上进行性能比较。
实验数据集:模型在 SemEval 2017 shared task 7 数据集 (SemEval 2017)
数据集 | SemEval | PTD | |
---|---|---|---|
语义 | 谐音 | ||
包含双关语的样例 | 1 607 | 1 271 | 2 423 |
不包含双关语的样例 | 643 | 509 | 2 403 |
注: 语义代表语义双关语,谐音代表谐音双关语。
PTD 数据集包含4 826个样例。
数据集 | PTD |
---|---|
幽默文本 | 2 423 |
不幽默文本 | 2 403 |
评价标准:选择使用准确率(P),召回率(R)以及F1值来比较PDPTL和基础模型以及其他基准模型的性能。其中代表被模型正确分类的包含双关语的样例数量,代表了模型判断为包含双关语的样例的数量,为真实包含双关语的样例数量。
, | (14) |
, | (15) |
。 | (16) |
基准模型:在SemEval 2017数据集上,PDPTL会与Dulut
实验细节设置:模型的超参数weight=0.84,=2,=4,times=5,num_train_epochs=7,confidence_coefficient=0.999 7,speed=0.000 1。但在PDT数据集上,times=3,num_train_epochs=5。模型的实验环境:pytorch-pretrained-bert==0.6.1, seqeval==0.0.5,torch==1.0.1.post2,tqdm==4.31.1,nltk==3.4.5,GPU型号为Tesla V100-SXM2,实验在Goolgle的Colab平台运行。
模型 | SemEval 2017 语义双关语 | SemEval 2017 谐音双关语 | ||||
---|---|---|---|---|---|---|
准确率 | 召回率 | F1 | 准确率 | 召回率 | F1 | |
JU_CSE_ NLP | 72.51 | 90.79 | 68.84 | 73.67 | 94.02 | 71.74 |
PunFields | 79.93 | 73.37 | 67.82 | 75.80 | 59.40 | 57.47 |
Fermi | 90.24 | 89.70 | 85.33 | _ | _ | _ |
Duluth | 78.32 | 87.24 | 82.54 | 73.99 | 86.62 | 68.71 |
CRF | 72.51 | 90.79 | 68.84 | 73.67 | 94.02 | 71.74 |
Joint | 91.25 | 93.28 | 92.19 | 86.67 | 93.08 | 89.76 |
CPR | 94.18 | 94.21 | 92.79 | 93.35 | 95.04 | 94.19 |
PDPTL | 96.26 | 96.21 | 96.23 | 95.79 | 96.85 | 96.31 |
模型 | Pun of the Day | ||
---|---|---|---|
准确率 | 召回率 | F1 | |
MCL | 83.80 | 65.50 | 73.50 |
HAE | 83.40 | 88.80 | 85.90 |
PAL | 86.60 | 85.40 | 85.70 |
HUR | 86.60 | 94.00 | 90.10 |
WECA | 89.19 | 90.64 | 89.21 |
CPR | 98.12 | 99.34 | 98.73 |
PDPTL | 98.61 | 99.54 | 99.08 |

图2 模型与基础模型在各个数据集的准确率与召回率
Fig. 2 The accuracy and recall rate of the model and the basic model in each data set

图3 模型与基础模型在各个数据集的F1值
Fig. 3 The F1 value of the model and the basic model in each dataset
针对现有的双关语数据集样本较少问题,提出利用伪标签技术辅助模型进行训练;考虑到伪标签数据和真实数据之间的特征分布差异,迁移学习技术和置信度相结合,提出一种新型双关语识别模型。使用该模型在SemEval 2017 shared task 7以及Pun of the Day 数据集上进行双关语检测实验,表明了PDPTL模型可拉近伪标签和真实标签数据的特征分布,预测性能均优于现有的主流双关语识别方法。
参考文献
徐琳宏, 林鸿飞, 祁瑞华, 等. 基于多特征融合的谐音广告语生成模型[J]. 中文信息学报, 2018, 32(10): 109-117. [百度学术]
Xu L H, Lin H F, Qi R H, et al. Homophonic advertisement generation based on features fusion[J]. Journal of Chinese Information Processing, 2018, 32(10): 109-117.(in Chinese) [百度学术]
Redfern W D. Guano of the mind: puns in advertising[J]. Language & Communication, 1982, 2(3): 269-276. [百度学术]
Diao Y F, Lin H F, Wu D, et al. WECA: a WordNet-encoded collocation-attention network for homographic pun recognition[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018: 2507–2516. [百度学术]
Miller T, Hempelmann C F, Gurevych I. Semeval-2017 task 7: detection and interpretation of english puns[C]//Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Vancouver, Canada: Association for Computational Linguistics, 2017: 58-68. [百度学术]
Pedersen T. Puns upon a midnight dreary, lexical semantics for the weak and weary[C]//Proceedings of the 11th International Workshop on Semantic Evaluation. Vancouver, Canada: Association for Computational Linguistics, 2017: 416-420. [百度学术]
Ranjan Pal A, Saha D. Word sense disambiguation: a survey[J]. International Journal of Control Theory and Computer Modeling, 2015, 5(3): 1-16. [百度学术]
Dieke O, Kilian E. Global vs. local context for interpreting and locating homographic english puns with sense embeddings[C]//Proceedings of the 11th International Workshop on Semantic Evaluation. Vancouver, Canada: Association for Computational Linguistics, 2017: 444-448. [百度学术]
Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[EB/OL]. [2021-06-10].. https://arxiv.org/abs/1310.4546.pdf. [百度学术]
Pennington J, Socher R, Manning C. Glove: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar. Stroudsburg, PA, USA: Association for Computational Linguistics, 2014: 1532–1543. [百度学术]
Zhou Y C, et al. The boating store had its best sail ever: pronunciation-attentive contextualized pun recognition[EB/OL].[2021-06-10]. https://arxiv.org/pdf/2004.14457.pdf. [百度学术]
Xiu Y L, et al. Using supervised and unsupervised methods to detect and locate english puns[C]//Proceedings of the 11th International Workshop on Semantic Evaluation. Vancouver, Canada: Association for Computational Linguistics, 2017: 453-456. [百度学术]
Samuel D, Aniruddha G, Hanyang C,et al. Detection and interpretation of english puns[C]//Proceedings of the 11th International Workshop on Semantic Evaluation. Vancouver, Canada: Association for Computational Linguistics, 2017: 103-108. [百度学术]
Lee D. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks[C]//In Workshop on Challenges in Representation Learning, Atlanta, Georgia: International Conference on Machine Learning, 2013. [百度学术]
Xie Q Z, Luong M T, Hovy E, et al. Self-training with noisy student improves ImageNet classification[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 13-19, 2020. Seattle, WA, USA: IEEE, 2020: 10684-10695. [百度学术]
Zhuang F Z, Qi Z Y, Duan K Y, et al. A comprehensive survey on transfer learning[J]. Proceedings of the IEEE, 2021, 109(1): 43-76. [百度学术]
Pan S J, Yang Q. A survey on transfer learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345-1359. [百度学术]
Huang J Y, Smola A J, Gretton A, et al. Correcting sample selection bias by unlabeled data[M]//Advances in Neural Information Processing Systems 19. US: MIT Press, 2007: 601-608. [百度学术]
Sugiyama M, Suzuki T, Nakajima S, et al. Direct importance estimation for covariate shift adaptation[J]. Annals of the Institute of Statistical Mathematics, 2008, 60(4): 699-746. [百度学术]
Day O, Khoshgoftaar T M. A survey on heterogeneous transfer learning[J]. Journal of Big Data, 2017, 4(1): 1-42. [百度学术]
Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2021-06-10].. https://arxiv.org/abs/1810.04805.pdf. [百度学术]
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[EB/OL]. [2021-06-10]. https://arxiv.org/abs/1409.0473.pdf. [百度学术]
Ashish V, Noam S, Niki P, et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017: 5998-6008. [百度学术]
Gretton A, Borgwardt K, Rasch M, et al. A kernel two-sample test[J]. Journal of Machine Learning Research, 2012(13):723-773. [百度学术]
Yanyan Z, Wei L. Joint detection and location of english puns[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minnesota: Association for Computational Linguistics, 2019: 2117-2123. [百度学术]
Aniket P, Dipankar D. Employing rules to detect and interpret english puns[C]// Proceedings of the 11th International Workshop on Semantic Evaluation. Vancouver, Canada: Association for Computational Linguistics, 2017: 432-435. [百度学术]
Mikhalkova E, Karyakin Y. Pun fields at SemEval-2017 task 7: employing roget’s thesaurus in automatic pun recognition and interpretation[C]//Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Vancouver, Canada. Stroudsburg, PA, USA: Association for Computational Linguistics, 2017. [百度学术]
Indurthi V, Oota S R. Fermi at SemEval-2017 task 7: detection and interpretation of homographic puns in English language[C]//Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Vancouver, Canada. Stroudsburg, PA, USA: Association for Computational Linguistics, 2017: 457-460. [百度学术]
Yang D Y, Lavie A, Dyer C, et al. Humor recognition and humor anchor extraction[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal. Stroudsburg, PA, USA: Association for Computational Linguistics, 2015: 2367-2376. [百度学术]
Mihalcea R, Strapparava C. Making computers laugh: investigations in automatic humor recognition[C]//Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing - HLT '05. October 6-8, 2005. Vancouver, ColumbiaBritish, Canada. Morristown, NJ, USA: Association for Computational Linguistics, 2005. [百度学术]
Chen L, Lee C M. Predicting audience’s laughter using convolutional neural network[EB/OL]. [2021-06-10].https://arxiv.org/abs/1702.02584.pdf. [百度学术]
Chen P Y, Soo V W. Humor recognition using deep learning[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). New Orleans, Louisiana. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018: 113-117. [百度学术]