首页> 外文期刊>ACM transactions on Asian language information processing >Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages


获取原文并翻译 | 示例


Phrase-based machine translation (MT) systems require large bilingual corpora for training. Nevertheless, such large bilingual corpora are unavailable for most language pairs in the world, causing a bottleneck for the development of MT. For the Asian language pairs-Japanese, Indonesian, Malay paired with Vietnamese- they are also not excluded from the case, in which there are no large bilingual corpora on these low-resource language pairs. Furthermore, although the languages are widely used in the world, there is no prior work on MT, which causes an issue for the development of MT on these languages. In this article, we conducted an empirical study of leveraging additional resources to improve MT for the Asian low-resource language pairs: translation from Japanese, Indonesian, and Malay to Vietnamese. We propose an innovative approach that lies in two strategies of building bilingual corpora from comparable data and phrase pivot translation on existing bilingual corpora of the languages paired with English. Bilingual corpora were built from Wikipedia bilingual titles to enhance bilingual data for the low-resource languages. Additionally, we introduced a combined model of the additional resources to create an effective solution to improve MT on the Asian low-resource languages. Experimental results show the effectiveness of our systems with the improvement of +2 to +7 BLEU points. This work contributes to the development of MT on low-resource languages, especially opening a promising direction for the progress of MT on the Asian language pairs.
机译:基于短语的机器翻译(MT)系统需要大型双语语料库进行培训。然而,世界上大多数语言对都无法使用如此庞大的双语语料库,这导致MT的发展成为瓶颈。对于亚洲语言对(日语,印度尼西亚语,马来语与越南语配对),也没有排除在这种情况下,在这些资源匮乏的语言对上没有大型双语语料库。此外,尽管语言在世界范围内被广泛使用,但是尚无关于MT的先前工作,这为开发这些语言的MT带来了问题。在本文中,我们进行了一项利用附加资源来提高亚洲低资源语言对MT的实证研究:从日语,印尼语和马来语到越南语的翻译。我们提出了一种创新的方法,该方法基于两种策略,即从可比较的数据构建双语语料库,并在与英语配对的现有双语语料库上进行词组透视翻译。双语语料库是从Wikipedia双语标题构建的,以增强资源较少的语言的双语数据。此外,我们引入了附加资源的组合模型,以创建有效的解决方案来提高亚洲低资源语言的MT。实验结果表明,我们的系统具有+2到+7 BLEU点的提高效果。这项工作为开发低资源语言的MT做出了贡献,尤其为亚洲语言对MT的发展打开了一个有希望的方向。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号