首页> 外文会议>Proceedings of the 2011 ACM/IEEE on joint conference on digital libraries. >Eliminating the Redundancy in Blocking-based Entity Resolution Methods
【24h】

Eliminating the Redundancy in Blocking-based Entity Resolution Methods

机译:消除基于块的实体解析方法中的冗余

获取原文
获取原文并翻译 | 示例

摘要

Entity resolution is the task of identifying entities that refer to the same real-world object. It has important applications in the context of digital libraries, such as citation matching and author disambiguation. Blocking is an established methodology for efficiently addressing this problem; it clusters similar entities together, and compares solely entities inside each cluster. In order to effectively deal with the current large, noisy and heterogeneous data collections, novel blocking methods that rely on redundancy have been introduced: they associate each entity with multiple blocks in order to increase recall, thus increasing the computational cost, as well. In this paper, we introduce novel techniques that remove the superfluous comparisons from any redundancy-based blocking method. They improve the time-efficiency of the latter without any impact on the end result. We present the optimal solution to this problem that discards all redundant comparisons at the cost of quadratic space complexity. For applications with space limitations, we also present an alternative, lightweight solution that operates at the abstract level of blocks in order to discard a significant part of the redundant comparisons. We evaluate our techniques on two large, real-world data sets and verify the significant improvements they convey when integrated into existing blocking methods.
机译:实体解析是识别引用同一真实世界对象的实体的任务。它在数字图书馆的环境中具有重要的应用,例如引文匹配和作者歧义消除。阻塞是有效解决此问题的既定方法。它将相似的实体聚类在一起,并且仅比较每个聚类中的实体。为了有效处理当前的大量,嘈杂的和异构的数据集合,已经引入了依赖冗余的新颖的阻止方法:它们将每个实体与多个块相关联以增加召回率,从而也增加了计算成本。在本文中,我们介绍了从任何基于冗余的阻止方法中删除多余比较的新颖技术。它们提高了后者的时间效率,而对最终结果没有任何影响。我们提出了这个问题的最佳解决方案,即以二次空间复杂度为代价,舍弃了所有多余的比较。对于有空间限制的应用程序,我们还提出了一种替代的轻量级解决方案,该解决方案在块的抽象级别上运行,以丢弃大部分冗余比较。我们在两个大型的实际数据集上评估我们的技术,并验证将它们集成到现有的阻塞方法中后所传达的重大改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号