首页> 外文会议>2011 IEEE Conference on Open Systems >A threshold-based similarity measure for duplicate detection
【24h】

A threshold-based similarity measure for duplicate detection

机译:基于阈值的相似度度量用于重复检测

获取原文

摘要

In order to extract beneficial information and recognize a particular pattern from huge data stored in different databases with different formats, data integration is essential. However the problem that arises here is that data integration may lead to duplication. In other words, due to the availability of data in different formats, there might be some records which refer to the same entity. Duplicate detection or record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. Most approaches concentrated on string similarity measures for comparing records. However, they fail to identify records which share the semantic information. So, in this study, a threshold-based method which takes into account both string and semantic similarity measures for comparing record pairs. This method is experimented on a real world dataset, namely Restaurant and its effectiveness is measured based on several standard evaluation metrics. As experimental results indicate, the proposed similarity method which is based on the combination of string and semantic similarity measures outperforms the individual similarity measures with the F-measure of 99.1% in Restaurant dataset. Therefore, based on experimental results, besides string similarity, semantic similarity should be considered in order to detect duplicate records more effectively.
机译:为了提取有益信息并识别存储在具有不同格式的不同数据库中的大数据的特定模式,数据集成是必不可少的。然而,这里出现的问题是数据集成可能导致重复。换句话说,由于以不同格式的数据的可用性,可能有一些引用相同实体的记录。重复检测或记录链接是一种技术,用于检测和匹配数据集成过程中生成的重复记录。大多数方法集中在比较记录的字符串相似措施上。但是,它们未能识别共享语义信息的记录。因此,在本研究中,一种基于阈值的方法,其考虑了用于比较记录对的字符串和语义相似度措施。该方法在真实世界数据集上进行了实验,即根据几个标准评估指标来衡量餐厅及其有效性。作为实验结果表明,基于串和语义相似度测量的组合的提议的相似性方法优于Restaurant DataSet中的F-Measure为99.1%的个人相似性措施。因此,基于实验结果,除了字符串相似性之外,应考虑语义相似性,以便更有效地检测重复记录。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号