【24h】

Cleaning StackOverflow for Machine Translation

机译:用于机器翻译的清洁stackoverflow

获取原文

摘要

Generating source code API sequences from an English query using Machine Translation (MT) has gained much interest in recent years. For any kind of MT, the model needs to be trained on a parallel corpus. In this paper we clean StackOverflow, one of the most popular online discussion forums for programmers, to generate a parallel English-Code corpus from Android posts. We contrast three data cleaning approaches: standard NLP, title only, and software task extraction. We evaluate the quality of the each corpus for MT. To provide indicators of how useful each corpus will be for machine translation, we provide researchers with measurements of the corpus size, percentage of unique tokens, and per-word maximum likelihood alignment entropy. We have used these corpus cleaning approaches to translate between English and Code [22, 23], to compare existing SMT approaches from word mapping to neural networks [24], and to re-examine the "natural software" hypothesis [29]. After cleaning and aligning the data, we create a simple maximum likelihood MT model to show that English words in the corpus map to a small number of specific code elements. This model provides a basis for the success of using StackOverflow for search and other tasks in the software engineering literature and paves the way for MT. Our scripts and corpora are publicly available on GitHub [1] as well as at https://search.datacite.org/works/10.5281/zenodo.2558551.
机译:使用机器翻译(MT)生成来自英语查询的源代码API序列,近年来已经获得了很多兴趣。对于任何类型的MT,所需的模型需要在并行语料库上培训。在本文中,我们清洁Stackoverflow,是程序员最受欢迎的在线讨论论坛之一,从Android Posts生成并行英语代码语料库。我们对比三种数据清洁方法:标准NLP,标题和软件任务提取。我们评估MT的每个语料库的质量。为了提供每个语料库如何用于机器翻译的指标,我们为研究人员提供了测量的语料库尺寸,唯一令牌的百分比和每字最大似然对准熵。我们使用这些语料库清洁方法来翻译英语和代码[22,23],将现有的SMT方法与神经网络进行比较[24],并重新检查“自然软件”假设[29]。清洁和对齐数据后,我们创建了一个简单的最大似然MT模型,以显示语料库映射中的英语单词到少量特定代码元素。该模型为使用StackOverflow进行搜索和软件工程文献中的其他任务的成功提供了基础,并为MT铺平了道路。我们的脚本和Corpora在Github [1]以及Https://search.datacite.org/works/10.5281/zenodo.2558551上公开提供。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号