结构特征一致性约束的双语平行句对抽取
作者:
中图分类号:

TP391

基金项目:

国家自然科学基金重点资助项目(61732005);国家自然科学基金资助项目(61662041,61761026,61866019,61972186);云南省应用基础研究计划重点资助项目(2019FA023);云南省中青年学术和技术带头人后备人才资助项目(2019HB006)。


Extraction of bilingual parallel sentence pairs constrained by consistency of structural features
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [23]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    平行句对抽取是解决低资源神经机器翻译平行语料不足的有效途径。基于孪生神经网络的平行句对抽取方法的核心是通过跨语言语义相似度判断2个句子是否平行,在相似的语言对上取得了非常显著的效果。然而针对英语东南亚语言双语句对抽取任务,面临语言空间和句子长度存在较大差异,仅考虑跨语言语义相似度而忽略句子长度特征会导致模型对仅有语义包含关系但不平行句对的误判。笔者提出一种结构特征一致性约束的双语平行句对抽取方法,该方法是对基于孪生神经网络的双语平行句对抽取模型的扩展,首先通过多语言BERT预训练语言模型在嵌入层将两种语言编码到同一语义空间,以此缩小语义空间中语言的差异。其次分别对两种语言句子的长度特征进行编码,与孪生网络编码后的句子语义向量进行融合,增强平行句对在语义及结构特征上的表示,降低模型对语义相似但不平行句对的误判。在英缅双语数据集上进行实验,结果表明提出的方法相比基线模型准确率提高了4.64%,召回率提高了2.52%,F1值提高了3.51%。

    Abstract:

    Parallel sentence pair extraction is an effective way to solve the shortage of low-resource neural machine translation. The main method based on Siamese neural network is to judge whether two sentences are parallel through cross-language semantic similarity, which has achieved remarkable results on similar language pairs. However, for English- Southeast Asia language sentence pairs extraction tasks, there are not only great differences in language space but also great differences in sentence length. Considering only cross-language semantic similarity and ignoring sentence length features will lead to misjudgment of non-parallel sentence pairs with only semantic inclusion. Therefore, this paper proposes a parallel sentence pairs extraction method constrained by consistency of structural features. The method is an extension of the model based on Siamese neural network. Firstly, using the multilingual BERT to embed the two languages into the same semantic space in the embedding layer, so as to reduce the language differences in the semantic space. Secondly, embedding the length features of sentences respectively, and fusing it with the semantic vectors of sentences encoded by Siamese networks to enhance the representation of parallel sentence pairs in semantic and structural features, so as to solve the misjudgment problem. We experiment on the English-Burmese data sets. The results show that the precision is increased by 4.64%, the recall is increased by 2.52%, and the F1 value is increased by 3.51% compared with the baseline.

    参考文献
    [1] 曹建文,万福成.面向自动问答系统的问句相似度计算研究[J].重庆大学学报,2019,42(9):114-122. Cao J W, Wan F C.Question similarity computing methid for antomatic question answering system[J]. Journal of Chongqing University, 2019, 42(9):114-122.(in Chinese)
    [2] Smith J, Quirk C, Toutanova K. Extracting parallel sentences from comparable corpora using document level alignment[C]//Human language technologies:The 2010 annual conference of the North American chapter of the Association for Computational Linguistics. Los Angeles, California, June 1-6, 2010. Stroudsburg:ACL, 2010:403-411.
    [3] Devlin J, Chang M, Lee K, et al. BERT:pre-training of deep bidirectional transformers for language understanding[J]. ArXiv:Computation and Language, 2018.
    [4] Munteanu D S, Marcu D. Extracting parallel sub-sentential fragments from non-parallel corpora[C]//Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney, Australia, July 17-21, 2006. Stroudsburg:ACL, 2006:81-88.
    [5] Zhao B, Vogel S. Adaptive parallel sentences mining from web bilingual news collection[C]//2002 IEEE International Conference on Data Mining, 2002. Proceedings. Maebashi City, Japan, December 9-12, 2002. Piscataway:IEEE, 2002:745-748.
    [6] Munteanu D S, Marcu D. Improving machine translation performance by exploiting non-parallel corpora[J]. Computational Linguistics, 2005, 31(4):477-504.
    [7] BarróN-Cedeno A, Espana-Bonet C, Boldoba J, et al. A factory of comparable corpora from wikipedia[C]//Proceedings of the Eighth Workshop on Building and Using Comparable Corpora. Beijing, China, July 30, 2015. Stroudsburg:ACL, 2015:3-13.
    [8] Tillmann C, Xu J. A simple sentence-level extraction algorithm for comparable data[C]//Proceedings of Human Language Technologies:The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume:Short Papers. Boulder, Colorado, May 31-June 5, 2009. Stroudsburg:ACL, 2009:93-96.
    [9] Chu C, Dabre R, Kurohashi S. Parallel sentence extraction from comparable corpora with neural network features[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation(LREC'16). Portorož, Slovenia, May 23-28, 2016. ELRA:LREC, 2016:2931-2935.
    [10] Irvine A, Callison-Burch C. Combining bilingual and comparable corpora for low resource machine translation[C]//Proceedings of the eighth workshop on statistical machine translation. Sofia, Bulgaria, August 8-9, 2013. Stroudsburg:ACL, 2013:262-270.
    [11] Afli H, Barrault L, Schwenk H. Multimodal comparable corpora as resources for extracting parallel data:Parallel phrases extraction[C]//Proceedings of the Sixth International Joint Conference on Natural Language Processing. Nagoya, Japan, October 14-19, 2013. Asian Federation of Natural Language Processing:IJCNLP, 2013:286-292.
    [12] Bouamor H, Sajjad H. Parallel sentence extraction from comparable corpora using multilingual sentence embeddings[C]//Proc. Workshop on Building and Using Comparable Corpora. Miyzaki, Japan May 8, 2018. Stroudsburg:ACL, 2018:298-305.
    [13] GréGoire F, Langlais P. A deep neural network approach to parallel sentence extraction[J]. ArXiv Preprint ArXiv:1709.09783, 2017.
    [14] Ramesh S H, Sankaranarayanan K P. Neural machine translation for low resource languages using bilingual lexicon induced from comparable corpora[J]. ArXiv Preprint ArXiv:1806.09652, 2018.
    [15] Grover J, Mitra P. Bilingual word embeddings with bucketed cnn for parallel sentence extraction[C]//Proceedings of ACL 2017, Student Research Workshop. Vancouver, Canada, July 30-August 4, 2017. Stroudsburg:ACL, 2017:11-16.
    [16] Ma X, Hovy E. End-to-end sequence labeling via bi-directional lstm-cnns-crf[J]. ArXiv preprint ArXiv:1603.01354, 2016.
    [17] Pennington J, Socher R, Manning C D. Glove:Global vectors for word representation[C]//Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Doha, Qatar, October 25-29, 2014. Stroudsburg:ACL, 2014:1532-1543.
    [18] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. ArXiv Preprint ArXiv:1301.3781, 2013.
    [19] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in neural information processing systems. Long Beach, CA, USA, Dec 4-9, 2017. CA, USA:IEEE, 2017:5998-6008.
    [20] 许晓泓,何霆,王华珍,等.结合Transformer模型与深度神经网络的数据到文本生成方法[J].重庆大学学报,2020,43(7):91-100. Xu X H, He T, Wang H Z, et al. Research on data-to-text generation basecl on transformer model and deep neural network[J]. Journal of Chongqing University, 2020, 43(7):91-100.(in Chinese)
    [21] Liu Y. Fine-tune BERT for extractive summarization[J]. arXiv preprint arXiv:1903.10318, 2019.
    [22] Pires T, Schlinger E, Garrette D. How multilingual is Multilingual BERT[J]. ArXiv Preprint ArXiv:1906.01502, 2019.
    [23] Wu S, Dredze M. Beto, bentz, becas:The surprising cross-lingual effectiveness of bert[J]. ArXiv Preprint ArXiv:1904.09077, 2019
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

毛存礼,高旭,余正涛,王振晗,高盛祥,满志博.结构特征一致性约束的双语平行句对抽取[J].重庆大学学报,2021,44(1):46-56.

复制
分享
文章指标
  • 点击次数:676
  • 下载次数: 1001
  • HTML阅读次数: 1015
  • 引用次数: 0
历史
  • 收稿日期:2020-09-10
  • 在线发布日期: 2021-01-08
文章二维码