首页> 外文会议>International conference on very large data bases >Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff
【24h】

Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff

机译:数据集版本控制的原理:探索娱乐/存储的权衡问题

获取原文

摘要

The relative ease of collaborative data science and analysis has led to a proliferation of many thousands or millions of versions of the same datasets in many scientific and commercial domains, acquired or constructed at various stages of data analysis across many users, and often over long periods of time. Managing, storing, and recreating these dataset versions is a non-trivial task. The fundamental challenge here is the storage-recreation trade-off: the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. Despite the fundamental nature of this problem, there has been a surprisingly little amount of work on it. In this paper, we study this trade-off in a principled manner: we formulate six problems under various settings, trading off these quantities in various ways, demonstrate that most of the problems are intractable, and propose a suite of inexpensive heuristics drawing from techniques in delay-constrained scheduling, and spanning tree literature, to solve these problems. We have built a prototype version management system, that aims to serve as a foundation to our DataHub system for facilitating collaborative data science. We demonstrate, via extensive experiments, that our proposed heuristics provide efficient solutions in practical dataset versioning scenarios.
机译:协作数据科学和分析的相对简便性已导致许多科学和商业领域中相同数据集的成千上万个版本的扩散,这些数据是在许多用户的数据分析的各个阶段(通常是长期)获得或构建的。时间。管理,存储和重新创建这些数据集版本是一项艰巨的任务。此处的基本挑战是存储量-存储量的折衷:我们使用的存储量越多,重新创建或检索版本的速度就越快,而我们使用的存储量越少,重新创建或检索版本的速度就越慢。尽管此问题具有基本性质,但有关该问题的工作却很少。在本文中,我们以原则性的方式研究了这种权衡:我们在各种情况下制定了六个问题,以各种方式权衡了这些数量,证明了大多数问题都是棘手的,并从技术中提出了一套廉价的启发式方法在延迟受限的调度中,并生成树文献,以解决这些问题。我们已经建立了一个原型版本管理系统,旨在作为我们的DataHub系统的基础,以促进协作数据科学。通过大量的实验,我们证明了我们提出的启发式方法可以在实际的数据集版本控制场景中提供有效的解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号