On-demand big data integration: A hybrid ETL approach for reproducible scientific research

Kathiravelu Pradeeban; Sharma Ashish; Galhardas Helena; Van Roy Peter; Veiga Luis

首页> 外文期刊>Distributed and Parallel Databases >On-demand big data integration: A hybrid ETL approach for reproducible scientific research

【24h】

On-demand big data integration: A hybrid ETL approach for reproducible scientific research

机译：按需大数据集成：一种可再生科学研究的混合ETL方法

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager extract, transform, and load (ETL) process constructs an integrated data repository as its first step, integrating and loading data in its entirety from the data sources. The bootstrapping of this process is not efficient for scientific research that requires access to data from very large and typically numerous distributed data sources. A lazy ETL process loads only the metadata, but still eagerly. Lazy ETL is faster in bootstrapping. However, queries on the integrated data repository of eager ETL perform faster, due to the availability of the entire data beforehand. In this paper, we propose a novel ETL approach for scientific data integration, as a hybrid of eager and lazy ETL approaches, and applied both to data as well as metadata. This way, hybrid ETL supports incremental integration and loading of metadata and data from the data sources. We incorporate a human-in-the-loop approach, to enhance the hybrid ETL, with selective data integration driven by the user queries and sharing of integrated data between users. We implement our hybrid ETL approach in a prototype platform, bidos, and evaluate it in the context of data sharing for medical research. bidos outperforms both the eager ETL and lazy ETL approaches, for scientific research data integration and sharing, through its selective loading of data and metadata, while storing the integrated data in a scalable integrated data repository.

机译：科学研究需要在互联网规模的各种异构数据源上分发的数据访问，分析和分享数据。渴望提取，变换和加载（ETL）过程将集成的数据存储库构成为其第一步，从数据源整体集成和加载数据。对该过程的启动对科学研究来说是不高的，这需要访问来自非常大的和通常许多分布式数据源的数据。懒惰的ETL进程仅加载元数据，但仍然急切地加载。懒惰的ETL在引导中更快。但是，由于事先的整个数据的可用性，eAger ETL的集成数据存储库上的查询更快地执行。本文提出了一种新的ETL方法，用于科学数据集成，作为渴望和懒惰的ETL方法的混合，并应用于数据以及元数据。这样，Hybrid Etl支持来自数据源的增量集成和加载元数据和数据。我们纳入了一个循环方法，以增强混合ETL，通过用户查询驱动的选择性数据集成以及用户之间的集成数据。我们在原型平台，BIDOS中实现了Hybrid ETL方法，并在医学研究数据共享的背景下进行评估。 BidoS优于eAger ETL和Lazy ETL方法，通过其选择性加载数据和元数据，在科学研究数据集成和共享，同时将集成数据存储在可伸缩的集成数据存储库中。

著录项

来源
《Distributed and Parallel Databases》 |2019年第2期|273-295|共23页
作者
Kathiravelu Pradeeban; Sharma Ashish; Galhardas Helena; Van Roy Peter; Veiga Luis;
展开▼
作者单位

Emory Univ Sch Med Atlanta GA 30322 USA|Univ Lisbon Inst Super Tecn INESC ID Lisboa Lisbon Portugal|Catholic Univ Louvain Louvain La Neuve Belgium;

Emory Univ Sch Med Atlanta GA 30322 USA;

Univ Lisbon Inst Super Tecn INESC ID Lisboa Lisbon Portugal;

Catholic Univ Louvain Louvain La Neuve Belgium;

Univ Lisbon Inst Super Tecn INESC ID Lisboa Lisbon Portugal;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Data integration; Scientific research; ETL (extract; transform; and load); Big data;

机译：数据集成;科学研究;ETL（提取物;变换;和负载）;大数据;

相似文献

外文文献
中文文献
专利

1. On-demand big data integration: A hybrid ETL approach for reproducible scientific research [J] . Kathiravelu Pradeeban, Sharma Ashish, Galhardas Helena, Distributed and Parallel Databases . 2019,第2期

机译：按需大数据集成：可重现科学研究的混合ETL方法
2. QETL: An approach to on-demand ETL from non-owned data sources [J] . Baldacci Lorenzo, Golfarelli Matteo, Graziani Simone, Data & Knowledge Engineering . 2017,第nova期

机译：QETL：一种从非自有数据源进行按需ETL的方法
3. HAR-SI: A novel hybrid article recommendation approach integrating with social information in scientific social network [J] . Wang Gang, He XiRan, Ishuga Carolyne Isigi Knowledge-Based Systems . 2018,第MAY15期

机译：HAR-SI：在科学社交网络中整合社交信息的新型混合文章推荐方法
4. Towards Semantic ETL for integration of textual scientific documents in a Big Data environment: a theoretical approach [C] . Chaimae Boulahia, Hicham Behja, Mohammed Reda Chbihi Louhdi IEEE Congress on Information Science and Technology . 2021

机译：在大数据环境中集成文本科学文本的语义ETL：一种理论方法
5. Embedded Scientific Computing: A Scalable, Interoperable and Reproducible Approach to Statistical Software for Data-Driven Business and Open Science. [D] . Ooms, Jeroen. 2014

机译：嵌入式科学计算：一种用于数据驱动型业务和开放科学的统计软件的可扩展，可互操作且可重现的方法。
6. Foundry: a message-oriented horizontally scalable ETL system for scientific data integration and enhancement [O] . Ibrahim Burak Ozyurt, Jeffrey S Grethe 2018

机译：铸造厂：面向消息的水平可扩展的ETL系统用于科学数据集成和增强
7. Scientific Models: A User-oriented Approach to the Integration of Scientific Data and Digital Libraries [O] . Hunter Jane 2006

机译：科学模型：科学数据与数字图书馆集成的面向用户的方法

On-demand big data integration: A hybrid ETL approach for reproducible scientific research

摘要

著录项

相似文献

相关主题

期刊订阅