首页> 外文会议>International Conference on Cooperative Information Systems >{partial deriv}u{partial deriv}u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
【24h】

{partial deriv}u{partial deriv}u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

机译:{Partial Deriv} U {Partial Deriv} U多租户框架:分布到大数据的重复检测附近

获取原文

摘要

Near duplicate detection algorithms have been proposed and implemented in order to detect and eliminate duplicate entries from massive datasets. Due to the differences in data representation (such as measurement units) across different data sources, potential duplicates may not be textually identical, even though they refer to the same real-world entity. As data warehouses typically contain data coming from several heterogeneous data sources, detecting near duplicates in a data ware-house requires a considerable memory and processing power. Traditionally, near duplicate detection algorithms are sequential and operate on a single computer. While parallel and distributed frameworks have recently been exploited in scaling the existing algorithms to operate over larger datasets, they are often focused on distributing a few chosen algorithms using frameworks such as MapReduce. A common distribution strategy and framework to parallelize the execution of the existing similarity join algorithms is still lacking. In-Memory Data Grids (IMDG) offer a distributed storage and execution, giving the illusion of a single large computer over multiple computing nodes in a cluster. This paper presents the research, design, and implementation of {partial deriv}u{partial deriv}u, a distributed near duplicate detection framework, with preliminary evaluations measuring its performance and achieved speed up. {partial deriv}u{partial deriv}u leverages the distributed shared memory and execution model provided by IMDG to execute existing near duplicate detection algorithms in a parallel and multi-tenanted environment. As a unified near duplicate detection framework for big data, {partial deriv}u{partial deriv}u efficiently distributes the algorithms over utility computers in research labs and private clouds and grids.
机译:近重复检测算法被提出,并以检测和消除大规模数据集的重复项实施。由于在数据表示在不同数据源的差异(如测量单元),可能的重复可能不是文本上是相同的,即使它们是指相同的真实世界的实体。作为数据仓库通常包含数据从多个不同数据源的到来,在数据货栈检测邻近重复需要相当大的存储器和处理能力。传统上,附近的重复检测算法是连续的,在一台计算机上运行。虽然并行和分布式框架最近已经在缩放现有的算法,以更大的数据集上操作利用,它们经常集中于使用分发框架,如MapReduce的几个选择的算法。一个常见的分销策略和框架并行存在相似的执行连接算法仍然缺乏。存储器内数据网格(IMDG)提供分布式存储和执行,给予单个大计算机的错觉在集群中的多个计算节点。本文介绍了研究,设计和实施{部分DERIV} U {部分DERIV} U,分布式近重复检测框架,初步评估衡量其性能,取得了加快。 {局部DERIV} U {局部DERIV} U利用由IMDG提供给执行现有附近以并行和多租户环境重复检测算法的分布式共享存储器和执行模型。作为近大数据重复检测框架,{部分DERIV} U {部分DERIV}统一ü有效地分散在公用的计算机算法的研究实验室和私有云和网格。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号