结构特征一致性约束的双语平行句对抽取
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP391

基金项目:

国家自然科学基金重点资助项目(61732005);国家自然科学基金资助项目(61662041,61761026,61866019,61972186);云南省应用基础研究计划重点资助项目(2019FA023);云南省中青年学术和技术带头人后备人才资助项目(2019HB006)。


Extraction of bilingual parallel sentence pairs constrained by consistency of structural features
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    平行句对抽取是解决低资源神经机器翻译平行语料不足的有效途径。基于孪生神经网络的平行句对抽取方法的核心是通过跨语言语义相似度判断2个句子是否平行,在相似的语言对上取得了非常显著的效果。然而针对英语东南亚语言双语句对抽取任务,面临语言空间和句子长度存在较大差异,仅考虑跨语言语义相似度而忽略句子长度特征会导致模型对仅有语义包含关系但不平行句对的误判。笔者提出一种结构特征一致性约束的双语平行句对抽取方法,该方法是对基于孪生神经网络的双语平行句对抽取模型的扩展,首先通过多语言BERT预训练语言模型在嵌入层将两种语言编码到同一语义空间,以此缩小语义空间中语言的差异。其次分别对两种语言句子的长度特征进行编码,与孪生网络编码后的句子语义向量进行融合,增强平行句对在语义及结构特征上的表示,降低模型对语义相似但不平行句对的误判。在英缅双语数据集上进行实验,结果表明提出的方法相比基线模型准确率提高了4.64%,召回率提高了2.52%,F1值提高了3.51%。

    Abstract:

    Parallel sentence pair extraction is an effective way to solve the shortage of low-resource neural machine translation. The main method based on Siamese neural network is to judge whether two sentences are parallel through cross-language semantic similarity, which has achieved remarkable results on similar language pairs. However, for English- Southeast Asia language sentence pairs extraction tasks, there are not only great differences in language space but also great differences in sentence length. Considering only cross-language semantic similarity and ignoring sentence length features will lead to misjudgment of non-parallel sentence pairs with only semantic inclusion. Therefore, this paper proposes a parallel sentence pairs extraction method constrained by consistency of structural features. The method is an extension of the model based on Siamese neural network. Firstly, using the multilingual BERT to embed the two languages into the same semantic space in the embedding layer, so as to reduce the language differences in the semantic space. Secondly, embedding the length features of sentences respectively, and fusing it with the semantic vectors of sentences encoded by Siamese networks to enhance the representation of parallel sentence pairs in semantic and structural features, so as to solve the misjudgment problem. We experiment on the English-Burmese data sets. The results show that the precision is increased by 4.64%, the recall is increased by 2.52%, and the F1 value is increased by 3.51% compared with the baseline.

    参考文献
    相似文献
    引证文献
引用本文

毛存礼,高旭,余正涛,王振晗,高盛祥,满志博.结构特征一致性约束的双语平行句对抽取[J].重庆大学学报,2021,44(1):46-56.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2020-09-10
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2021-01-08
  • 出版日期:
文章二维码