首页> 外文会议>IEEE/ACM International Conference on Mining Software Repositories >Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow
【24h】

Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

机译:学习挖掘堆栈溢出的对齐代码和自然语言对

获取原文

摘要

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. StackOverflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g. pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.
机译:对于自然语言的代码合成,代码检索和代码摘要等任务,数据驱动模型显示了很大的承诺。但是,创建这些模型需要自然语言(NL)之间的并行数据和具有细粒度对齐的代码。 StackOverflow(So)是创建此类数据集的有希望的源:问题是多样化的,其中大多数都有相应的答案,具有高质量的代码片段。然而,现有的启发式方法(例如,使用已接受答案中的代码与代码的帖子的标题配对)受到限制在其覆盖范围和所获得的NL代码对的正确性。在本文中,我们提出了一种新的方法来使用两组特征来挖掘高质量对齐数据:考虑提取的片段结构的手工制作特征,并通过训练概率模型获得的对应特征来捕获之间的相关性使用神经网络的NL和代码。这些功能被馈入到一个分类器中,该分类器确定挖掘的NL代码对的质量。使用Python和Java作为测试床的实验表明,所提出的方法大大扩展了现有采矿方法的覆盖率和准确性,即使仅使用少量标记的示例。此外,我们发现即使在培训一个语言和另一个语言和测试的分类器时也可以实现合理的结果,显示了将NL代码挖掘到多种编程语言,超出我们能够注释数据的各种编程语言。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号