结构特征一致性约束的双语平行句对抽取

doi:10.11835/j.issn.1000-582X.2021.01.006

首页 > 过刊浏览>2021年第44卷第1期 >46-56. DOI:10.11835/j.issn.1000-582X.2021.01.006

结构特征一致性约束的双语平行句对抽取
DOI:
                        10.11835/j.issn.1000-582X.2021.01.006
                    
CSTR:
                        
                    
作者:
                        毛存礼毛存礼
昆明理工大学 信息工程与自动化学院, 昆明 650500;昆明理工大学 云南省人工智能重点实验室, 昆明 650500
在期刊界中查找
在百度中查找
在本站中查找
高旭高旭
昆明理工大学 信息工程与自动化学院, 昆明 650500;昆明理工大学 云南省人工智能重点实验室, 昆明 650500
在期刊界中查找
在百度中查找
在本站中查找
余正涛余正涛
昆明理工大学 信息工程与自动化学院, 昆明 650500;昆明理工大学 云南省人工智能重点实验室, 昆明 650500
在期刊界中查找
在百度中查找
在本站中查找
王振晗王振晗
昆明理工大学 信息工程与自动化学院, 昆明 650500;昆明理工大学 云南省人工智能重点实验室, 昆明 650500
在期刊界中查找
在百度中查找
在本站中查找
高盛祥高盛祥
昆明理工大学 信息工程与自动化学院, 昆明 650500;昆明理工大学 云南省人工智能重点实验室, 昆明 650500
在期刊界中查找
在百度中查找
在本站中查找
满志博满志博
昆明理工大学 信息工程与自动化学院, 昆明 650500;昆明理工大学 云南省人工智能重点实验室, 昆明 650500
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP391
基金项目:国家自然科学基金重点资助项目（61732005）；国家自然科学基金资助项目（61662041，61761026，61866019，61972186）；云南省应用基础研究计划重点资助项目（2019FA023）；云南省中青年学术和技术带头人后备人才资助项目（2019HB006）。

Extraction of bilingual parallel sentence pairs constrained by consistency of structural features

Author:

MAO Cunli
MAO Cunli
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, P. R. China;Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, P. R. China
在期刊界中查找
在百度中查找
在本站中查找
GAO Xu
GAO Xu
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, P. R. China;Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, P. R. China
在期刊界中查找
在百度中查找
在本站中查找
YU Zhengtao
YU Zhengtao
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, P. R. China;Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, P. R. China
在期刊界中查找
在百度中查找
在本站中查找
WANG Zhenhan
WANG Zhenhan
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, P. R. China;Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, P. R. China
在期刊界中查找
在百度中查找
在本站中查找
GAO Shengxiang
GAO Shengxiang
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, P. R. China;Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, P. R. China
在期刊界中查找
在百度中查找
在本站中查找
MAN Zhibo
MAN Zhibo
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, P. R. China;Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, P. R. China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [23]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

平行句对抽取是解决低资源神经机器翻译平行语料不足的有效途径。基于孪生神经网络的平行句对抽取方法的核心是通过跨语言语义相似度判断2个句子是否平行，在相似的语言对上取得了非常显著的效果。然而针对英语东南亚语言双语句对抽取任务，面临语言空间和句子长度存在较大差异，仅考虑跨语言语义相似度而忽略句子长度特征会导致模型对仅有语义包含关系但不平行句对的误判。笔者提出一种结构特征一致性约束的双语平行句对抽取方法，该方法是对基于孪生神经网络的双语平行句对抽取模型的扩展，首先通过多语言BERT预训练语言模型在嵌入层将两种语言编码到同一语义空间，以此缩小语义空间中语言的差异。其次分别对两种语言句子的长度特征进行编码，与孪生网络编码后的句子语义向量进行融合，增强平行句对在语义及结构特征上的表示，降低模型对语义相似但不平行句对的误判。在英缅双语数据集上进行实验，结果表明提出的方法相比基线模型准确率提高了4.64%，召回率提高了2.52%，F₁值提高了3.51%。

关键词:双语平行句对;低资源语言;BERT预训练;孪生网络;结构

Abstract:

Parallel sentence pair extraction is an effective way to solve the shortage of low-resource neural machine translation. The main method based on Siamese neural network is to judge whether two sentences are parallel through cross-language semantic similarity, which has achieved remarkable results on similar language pairs. However, for English- Southeast Asia language sentence pairs extraction tasks, there are not only great differences in language space but also great differences in sentence length. Considering only cross-language semantic similarity and ignoring sentence length features will lead to misjudgment of non-parallel sentence pairs with only semantic inclusion. Therefore, this paper proposes a parallel sentence pairs extraction method constrained by consistency of structural features. The method is an extension of the model based on Siamese neural network. Firstly, using the multilingual BERT to embed the two languages into the same semantic space in the embedding layer, so as to reduce the language differences in the semantic space. Secondly, embedding the length features of sentences respectively, and fusing it with the semantic vectors of sentences encoded by Siamese networks to enhance the representation of parallel sentence pairs in semantic and structural features, so as to solve the misjudgment problem. We experiment on the English-Burmese data sets. The results show that the precision is increased by 4.64%, the recall is increased by 2.52%, and the F₁ value is increased by 3.51% compared with the baseline.

Key words:parallel sentence;low-resource;BERT pretrain;siamese network;structural

参考文献

[1] 曹建文,万福成.面向自动问答系统的问句相似度计算研究[J].重庆大学学报,2019,42(9):114-122. Cao J W, Wan F C.Question similarity computing methid for antomatic question answering system[J]. Journal of Chongqing University, 2019, 42(9):114-122.(in Chinese)

[2] Smith J, Quirk C, Toutanova K. Extracting parallel sentences from comparable corpora using document level alignment[C]//Human language technologies:The 2010 annual conference of the North American chapter of the Association for Computational Linguistics. Los Angeles, California, June 1-6, 2010. Stroudsburg:ACL, 2010:403-411.

[3] Devlin J, Chang M, Lee K, et al. BERT:pre-training of deep bidirectional transformers for language understanding[J]. ArXiv:Computation and Language, 2018.

[4] Munteanu D S, Marcu D. Extracting parallel sub-sentential fragments from non-parallel corpora[C]//Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney, Australia, July 17-21, 2006. Stroudsburg:ACL, 2006:81-88.

[5] Zhao B, Vogel S. Adaptive parallel sentences mining from web bilingual news collection[C]//2002 IEEE International Conference on Data Mining, 2002. Proceedings. Maebashi City, Japan, December 9-12, 2002. Piscataway:IEEE, 2002:745-748.

[6] Munteanu D S, Marcu D. Improving machine translation performance by exploiting non-parallel corpora[J]. Computational Linguistics, 2005, 31(4):477-504.

[7] BarróN-Cedeno A, Espana-Bonet C, Boldoba J, et al. A factory of comparable corpora from wikipedia[C]//Proceedings of the Eighth Workshop on Building and Using Comparable Corpora. Beijing, China, July 30, 2015. Stroudsburg:ACL, 2015:3-13.

[8] Tillmann C, Xu J. A simple sentence-level extraction algorithm for comparable data[C]//Proceedings of Human Language Technologies:The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume:Short Papers. Boulder, Colorado, May 31-June 5, 2009. Stroudsburg:ACL, 2009:93-96.

[9] Chu C, Dabre R, Kurohashi S. Parallel sentence extraction from comparable corpora with neural network features[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation(LREC'16). Portorož, Slovenia, May 23-28, 2016. ELRA:LREC, 2016:2931-2935.

[10] Irvine A, Callison-Burch C. Combining bilingual and comparable corpora for low resource machine translation[C]//Proceedings of the eighth workshop on statistical machine translation. Sofia, Bulgaria, August 8-9, 2013. Stroudsburg:ACL, 2013:262-270.

[11] Afli H, Barrault L, Schwenk H. Multimodal comparable corpora as resources for extracting parallel data:Parallel phrases extraction[C]//Proceedings of the Sixth International Joint Conference on Natural Language Processing. Nagoya, Japan, October 14-19, 2013. Asian Federation of Natural Language Processing:IJCNLP, 2013:286-292.

[12] Bouamor H, Sajjad H. Parallel sentence extraction from comparable corpora using multilingual sentence embeddings[C]//Proc. Workshop on Building and Using Comparable Corpora. Miyzaki, Japan May 8, 2018. Stroudsburg:ACL, 2018:298-305.

[13] GréGoire F, Langlais P. A deep neural network approach to parallel sentence extraction[J]. ArXiv Preprint ArXiv:1709.09783, 2017.

[14] Ramesh S H, Sankaranarayanan K P. Neural machine translation for low resource languages using bilingual lexicon induced from comparable corpora[J]. ArXiv Preprint ArXiv:1806.09652, 2018.

[15] Grover J, Mitra P. Bilingual word embeddings with bucketed cnn for parallel sentence extraction[C]//Proceedings of ACL 2017, Student Research Workshop. Vancouver, Canada, July 30-August 4, 2017. Stroudsburg:ACL, 2017:11-16.

[16] Ma X, Hovy E. End-to-end sequence labeling via bi-directional lstm-cnns-crf[J]. ArXiv preprint ArXiv:1603.01354, 2016.

[17] Pennington J, Socher R, Manning C D. Glove:Global vectors for word representation[C]//Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Doha, Qatar, October 25-29, 2014. Stroudsburg:ACL, 2014:1532-1543.

[18] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. ArXiv Preprint ArXiv:1301.3781, 2013.

[19] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in neural information processing systems. Long Beach, CA, USA, Dec 4-9, 2017. CA, USA:IEEE, 2017:5998-6008.

[20] 许晓泓,何霆,王华珍,等.结合Transformer模型与深度神经网络的数据到文本生成方法[J].重庆大学学报,2020,43(7):91-100. Xu X H, He T, Wang H Z, et al. Research on data-to-text generation basecl on transformer model and deep neural network[J]. Journal of Chongqing University, 2020, 43(7):91-100.(in Chinese)

[21] Liu Y. Fine-tune BERT for extractive summarization[J]. arXiv preprint arXiv:1903.10318, 2019.

[22] Pires T, Schlinger E, Garrette D. How multilingual is Multilingual BERT[J]. ArXiv Preprint ArXiv:1906.01502, 2019.

[23] Wu S, Dredze M. Beto, bentz, becas:The surprising cross-lingual effectiveness of bert[J]. ArXiv Preprint ArXiv:1904.09077, 2019

引用本文

毛存礼,高旭,余正涛,王振晗,高盛祥,满志博.结构特征一致性约束的双语平行句对抽取[J].重庆大学学报,2021,44(1):46-56.

复制

文章指标

点击次数:676
下载次数: 1001
HTML阅读次数: 1015
引用次数: 0

历史

收稿日期:2020-09-10
最后修改日期:
录用日期:
在线发布日期: 2021-01-08
出版日期:

期刊社主页

编辑部首页

期刊介绍

编委会

数据库收录

过刊浏览

联系我们

引用本文

相关视频

分享

文章指标

历史

文章二维码

期刊社主页

编辑部首页

期刊介绍

编委会

数据库收录

过刊浏览

联系我们

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码