Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

机译：学习挖掘堆栈溢出的对齐代码和自然语言对

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. StackOverflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g. pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.

机译：对于自然语言的代码合成，代码检索和代码摘要等任务，数据驱动模型显示了很大的承诺。但是，创建这些模型需要自然语言（NL）之间的并行数据和具有细粒度对齐的代码。 StackOverflow（So）是创建此类数据集的有希望的源：问题是多样化的，其中大多数都有相应的答案，具有高质量的代码片段。然而，现有的启发式方法（例如，使用已接受答案中的代码与代码的帖子的标题配对）受到限制在其覆盖范围和所获得的NL代码对的正确性。在本文中，我们提出了一种新的方法来使用两组特征来挖掘高质量对齐数据：考虑提取的片段结构的手工制作特征，并通过训练概率模型获得的对应特征来捕获之间的相关性使用神经网络的NL和代码。这些功能被馈入到一个分类器中，该分类器确定挖掘的NL代码对的质量。使用Python和Java作为测试床的实验表明，所提出的方法大大扩展了现有采矿方法的覆盖率和准确性，即使仅使用少量标记的示例。此外，我们发现即使在培训一个语言和另一个语言和测试的分类器时也可以实现合理的结果，显示了将NL代码挖掘到多种编程语言，超出我们能够注释数据的各种编程语言。

著录项

来源
《IEEE/ACM International Conference on Mining Software Repositories》|2018年|xxxiii 596 p. :|共11页
会议地点
作者
Pengcheng Yin; Bowen Deng; Edgar Chen; Bogdan Vasilescu; Graham Neubig;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.5-532;
关键词
Data mining; Python; Feature extraction; Java; Natural languages; Training;

机译：数据挖掘;Python;特征提取;Java;自然语言;培训;

相似文献

外文文献
中文文献
专利

1. On using Stack Overflow comment-edit pairs to recommend code maintenance changes [J] . Henry Tang, Sarah Nadi Empirical Software Engineering . 2021,第4期

机译：使用堆栈溢出注释 - 编辑对建议代码维护更改
2. Generating Question Titles for Stack Overflow from Mined Code Snippets [J] . ZHIPENG GAO, XIN XIA, JOHN GRUNDY, ACM transactions on software engineering and methodology . 2020,第4期

机译：生成堆栈溢出的问题标题，来自挖掘代码片段
3. Analysis of Titles from the Questions of the Stack Overflow Community Using Natural Language Processing (NLP) Techniques [J] . Tapan Kumar Hazra, Aryak Sengupta, Anirban Ghosh IOSR journal of computer engineering . 2015,第4期

机译：使用自然语言处理（NLP）技术分析堆栈溢出社区问题中的标题
4. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow [C] . Pengcheng Yin, Bowen Deng, Edgar Chen, IEEE/ACM International Conference on Mining Software Repositories . 2018

机译：从堆栈溢出中学习对齐的代码和自然语言对
5. Missing Data Imputation Using Machine Learning and Natural Language Processing for Clinical Diagnostic Codes [D] . Choudhury, Arkopal. 2020

机译：使用机器学习和临床诊断代码的自然语言处理缺少数据载荷
6. Factors affecting the appearance of ‘twin language’: An original language naturally developing within twin pairs [O] . Chisato Hayashi, Kazuo Hayakawa 2004

机译：影响双胞胎语言出现的因素：双胞胎对中自然发展的原始语言
7. Learning to mine aligned code and natural language pairs from stack overflow [O] . Pengcheng Yin, Bowen Deng, Edgar Chen, 2018

机译：学习挖掘堆栈溢出的对齐代码和自然语言对

Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

摘要

著录项

相似文献

相关主题

期刊订阅