Abstract:Parallel sentence pair extraction is an effective way to solve the shortage of low-resource neural machine translation. The main method based on Siamese neural network is to judge whether two sentences are parallel through cross-language semantic similarity, which has achieved remarkable results on similar language pairs. However, for English- Southeast Asia language sentence pairs extraction tasks, there are not only great differences in language space but also great differences in sentence length. Considering only cross-language semantic similarity and ignoring sentence length features will lead to misjudgment of non-parallel sentence pairs with only semantic inclusion. Therefore, this paper proposes a parallel sentence pairs extraction method constrained by consistency of structural features. The method is an extension of the model based on Siamese neural network. Firstly, using the multilingual BERT to embed the two languages into the same semantic space in the embedding layer, so as to reduce the language differences in the semantic space. Secondly, embedding the length features of sentences respectively, and fusing it with the semantic vectors of sentences encoded by Siamese networks to enhance the representation of parallel sentence pairs in semantic and structural features, so as to solve the misjudgment problem. We experiment on the English-Burmese data sets. The results show that the precision is increased by 4.64%, the recall is increased by 2.52%, and the F1 value is increased by 3.51% compared with the baseline.