首页> 外文会议>IEEE/ACM International Conference on Mining Software Repositories >The Software Heritage Graph Dataset: Public Software Development Under One Roof
【24h】

The Software Heritage Graph Dataset: Public Software Development Under One Roof

机译:软件遗产图数据集:一个屋檐下的公共软件开发

获取原文

摘要

Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects. This paper introduces the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset's contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing. Source code file contents are cross-referenced at the graph leaves, and can be retrieved through individual requests using the Software Heritage archive API.
机译:软件遗产是中国现存规模最大的软件源代码公共档案和相应的发展历史:它目前跨度超过五个十亿独特的源代码文件,一个十亿独特的提交,从80个多万软件项目的到来。本文介绍了软件遗产图形数据集:软件遗产档案的完全重复数据删除梅克尔DAG表示。该数据集链接在一起的文件内容标识符,源代码目录,版本控制系统(VCS)将提交跟踪随着时间的演变长达如在定期抓取通过软件遗产观察到VCS库的完整状态。该数据集的内容来自主要发展锻造(包括GitHub上和GitLab),FOSS分布(例如,Debian的),和特定于语言的软件包管理器(例如,的PyPI)。爬行信息也被包括在内,提供关于当所有归档源代码工件已经在野外被观察到,并且其中时间戳。软件遗产图形数据集是多种格式,包括可下载的CSV转储和Apache平面文件供本地使用,以及为准备使用的功能强大的分析处理在亚马逊雅典娜交互式查询服务的公共实例可用。源代码文件内容是交叉引用在图形叶子,并且可以通过使用软件遗产档案API个人要求进行检索。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号