Semantic Source Code Models Using Identifier Embeddings

机译：使用标识符Embeddings的语义源代码模型

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The emergence of online open source repositories in the recent years has led to an explosion in the volume of openly available source code, coupled with metadata that relate to a variety of software development activities. As an effect, in line with recent advances in machine learning research, software maintenance activities are switching from symbolic formal methods to data-driven methods. In this context, the rich semantics hidden in source code identifiers provide opportunities for building semantic representations of code which can assist tasks of code search and reuse. To this end, we deliver in the form of pretrained vector space models, distributed code representations for six popular programming languages, namely, Java, Python, PHP, C, C++, and C#. The models are produced using fastText, a state-of-the-art library for learning word representations. Each model is trained on data from a single programming language; the code mined for producing all models amounts to over 13.000 repositories. We indicate dissimilarities between natural language and source code, as well as variations in coding conventions in between the different programming languages we processed. We describe how these heterogeneities guided the data preprocessing decisions we took and the selection of the training parameters in the released models. Finally, we propose potential applications of the models and discuss limitations of the models.

机译：近年来在线开源存储库的出现导致公开可用源代码的爆炸源，与多种软件开发活动相关的元数据。作为一种效果，符合机器学习研究的最新进步，软件维护活动正在从符号正式方法切换到数据驱动方法。在这种情况下，隐藏在源代码标识符中的丰富语义为构建代码的语义表示提供了可以帮助代码搜索和重用的任务的机会。为此，我们以佩带的矢量空间模型的形式提供，分布式代码表示六种流行的编程语言，即Java，Python，PHP，C，C ++和C＃。这些模型是使用FastText生产的，用于学习词表示的最先进的库。每个模型都在从单个编程语言中培训数据;用于生产所有型号的代码将多到超过13,000多个存储库。我们表示自然语言和源代码之间的异化，以及我们处理的不同编程语言之间的编码约定的变化。我们描述了这些异质性如何引导我们采取的数据预处理决策以及在发布模型中选择培训参数。最后，我们提出了模型的潜在应用，并讨论了模型的局限性。

著录项

来源
《IEEE/ACM International Conference on Mining Software Repositories》|2019年|xxxiv 606 p. :|共5页
会议地点
作者
Vasiliki Efstathiou; Diomidis Spinellis;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类安全保密;
关键词
C; language; C++ language; data mining; Java; learning (artificial intelligence); natural languages; programming languages; Python; software maintenance; source code (software); text analysis;

机译：C;语言;C ++语言;数据挖掘;Java;学习（人工智能）;自然语言;编程语言;Python;软件维护;源代码（软件）;文本分析;

相似文献

外文文献
中文文献
专利

1. On the generation, structure, and semantics of grammar patterns in source code identifiers [J] . Christian D. Newman, Reem S. AlSuhaibani, Michael J. Decker, The Journal of Systems and Software . 2020,第Deca期

机译：关于源代码标识符中语法模式的生成，结构和语义
2. Semantic clustering: Identifying topics in source code [J] . Adrian Kuhn, Stephane Ducasse, Tudor Girba Information and software technology . 2007,第3期

机译：语义聚类：在源代码中识别主题
3. Code obfuscation using very long identifiers for FFT motion estimation models in embedded processors [J] . Meyer-Baese Uwe, Meyer-Baese Anke, Gonzalez Diego, Journal of Real-Time Image Processing . 2016,第4期

机译：对于嵌入式处理器中的FFT运动估计模型，使用非常长的标识符进行代码混淆
4. Semantic Source Code Models Using Identifier Embeddings [C] . Vasiliki Efstathiou, Diomidis Spinellis IEEE/ACM International Conference on Mining Software Repositories . 2019

机译：使用标识符嵌入的语义源代码模型
5. Source re-coding to create parallel and flexible MPSoC models for embedded system design and exploration. [D] . Chandraiah, Pramod. 2008

机译：重新编码源代码，以创建用于嵌入式系统设计和探索的并行，灵活的MPSoC模型。
6. Evaluation of the Clinical LOINC (Logical Observation Identifiers Names and Codes) Semantic Structure as a Terminology Model for Standardized Assessment Measures [O] . Suzanne Bakken, James J. Cimino, Robert Haskell, 2000

机译：评估临床LOINC（逻辑观察标识符名称和代码）的语义结构作为标准化评估措施的术语模型
7. Semantic Source Code Models Using Identifier Embeddings [O] . Vasiliki Efstathiou, Diomidis Spinellis 2019

机译：使用标识符Embeddings的语义源代码模型
8. Methane Modeling: Predicting the Inflow of Methane Gas into Coal Mines. Phase 2 - Small-Scale in-Mine Tests and Development of Two-Dimensional Models. Phase 3 - Final Report and Computer Source Codes. Volume 2. Computer Source Codes [R] . Schwerer, F. C. , Bollinger, E. R. , Pavone, A. M. , 1984

机译：甲烷模拟：预测甲烷气体进入煤矿。第2阶段 - 小规模矿井试验和二维模型的开发。第3阶段 - 最终报告和计算机源代码。第2卷。计算机源代码

Semantic Source Code Models Using Identifier Embeddings

摘要

著录项

相似文献

相关主题

期刊订阅