{partial deriv}u{partial deriv}u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

机译：{Partial Deriv} U {Partial Deriv} U多租户框架：分布到大数据的重复检测附近

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Near duplicate detection algorithms have been proposed and implemented in order to detect and eliminate duplicate entries from massive datasets. Due to the differences in data representation (such as measurement units) across different data sources, potential duplicates may not be textually identical, even though they refer to the same real-world entity. As data warehouses typically contain data coming from several heterogeneous data sources, detecting near duplicates in a data ware-house requires a considerable memory and processing power. Traditionally, near duplicate detection algorithms are sequential and operate on a single computer. While parallel and distributed frameworks have recently been exploited in scaling the existing algorithms to operate over larger datasets, they are often focused on distributing a few chosen algorithms using frameworks such as MapReduce. A common distribution strategy and framework to parallelize the execution of the existing similarity join algorithms is still lacking. In-Memory Data Grids (IMDG) offer a distributed storage and execution, giving the illusion of a single large computer over multiple computing nodes in a cluster. This paper presents the research, design, and implementation of {partial deriv}u{partial deriv}u, a distributed near duplicate detection framework, with preliminary evaluations measuring its performance and achieved speed up. {partial deriv}u{partial deriv}u leverages the distributed shared memory and execution model provided by IMDG to execute existing near duplicate detection algorithms in a parallel and multi-tenanted environment. As a unified near duplicate detection framework for big data, {partial deriv}u{partial deriv}u efficiently distributes the algorithms over utility computers in research labs and private clouds and grids.

机译：近重复检测算法被提出，并以检测和消除大规模数据集的重复项实施。由于在数据表示在不同数据源的差异（如测量单元），可能的重复可能不是文本上是相同的，即使它们是指相同的真实世界的实体。作为数据仓库通常包含数据从多个不同数据源的到来，在数据货栈检测邻近重复需要相当大的存储器和处理能力。传统上，附近的重复检测算法是连续的，在一台计算机上运行。虽然并行和分布式框架最近已经在缩放现有的算法，以更大的数据集上操作利用，它们经常集中于使用分发框架，如MapReduce的几个选择的算法。一个常见的分销策略和框架并行存在相似的执行连接算法仍然缺乏。存储器内数据网格（IMDG）提供分布式存储和执行，给予单个大计算机的错觉在集群中的多个计算节点。本文介绍了研究，设计和实施{部分DERIV} U {部分DERIV} U，分布式近重复检测框架，初步评估衡量其性能，取得了加快。 {局部DERIV} U {局部DERIV} U利用由IMDG提供给执行现有附近以并行和多租户环境重复检测算法的分布式共享存储器和执行模型。作为近大数据重复检测框架，{部分DERIV} U {部分DERIV}统一ü有效地分散在公用的计算机算法的研究实验室和私有云和网格。

著录项

来源
《International Conference on Cooperative Information Systems》|2015年||共20页
会议地点
作者
Pradeeban Kathiravelu; Helena Galhardas; Luis Veiga;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP393-53;
关键词
Near Duplicate Detection (NDD); In-Memory Data Grid (IMDG); MapReduce;

机译：近重复检测（NDD）;内存数据网格（IMDG）;MapReduce;

相似文献

外文文献
中文文献
专利

1. Partial deriv for Data: Differentiating Data Structures [J] . Michael Abbott, Thorsten Altenkirch, Conor McBride, Fundamenta Informaticae . 2005,第1a2期

机译：数据的部分导出：区分数据结构
2. Solution of partial deriv equation with compactly supported data [J] . T. ANDREADIS Rendiconti di Matematica e delle sue Applicazioni . 1996,第3期

机译：具有紧致支持数据的偏导数方程的求解
3. A data-based framework for fault detection and diagnostics of non-linear systems with partial state measurement [J] . Niranjan Subrahmanya, Yung C. Shin Engineering Applications of Artificial Intelligence . 2013,第1期

机译：一个基于数据的框架，用于部分状态测量的非线性系统的故障检测和诊断
4. {partial deriv}u{partial deriv}u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data [C] . Pradeeban Kathiravelu, Helena Galhardas, Luis Veiga International Conference on Cooperative Information Systems . 2015

机译：{Partial Deriv} U {Partial Deriv} U多租户框架：分布到大数据的重复检测附近
5. A scalable partial-order data structure for distributed-system observation. [D] . Ward, Paul A. S. 2002

机译：用于分布式系统观察的可伸缩部分顺序数据结构。
6. A brother and sister with the same karyotype: Case report of two siblings with partial 3p duplication and partial 9p deletion and sex reversal [O] . Susan Cordes Selby, Aiko Iwata‐Otsubo, Paula Delk, 2021

机译：一个兄弟姐妹具有相同的核型：案例报告两个兄弟姐妹部分3p重复和部分9p删除和性逆转
7. Detection of partial deletion and partial duplication of dystrophin gene in Japanese patients with Duchenne or Becker muscular dystrophy [O] . Keiko Hiyama, Mieko Kodaira, Chiyoko Satoh, 1993

机译：Duchenne或Becker肌营养不良患者患者尿黄素基因部分缺失和部分重复的检测
8. Large-Scale Partial-Duplicate Image Retrieval and Its Applications. [R] . Lu, Y., Tian, Q. 2016

机译：大规模部分重复图像检索及其应用。

{partial deriv}u{partial deriv}u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

摘要

著录项

相似文献

相关主题

期刊订阅