...
首页> 外文期刊>ACM Transactions on Information Systems >Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web
【24h】

Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web

机译:改善万维网上信息持久性的词汇签名分析

获取原文
获取原文并翻译 | 示例
           

摘要

A lexical signature (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency- (TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator called Test & Select (TS) to mitigate LS conflict. TS outperforms all eight static methods in terms of both extracting the desired document and finding relevant information, over three different search engines. All LS methods show significant performance degradation as documents in the corpus are edited.
机译:由Web文档中的几个关键字组成的词法签名(LS)通常是足够的信息,以便以后查找该文档,即使其URL已更改也是如此。我们对生成词法签名的9种方法进行了大规模的实证研究,包括Phelps和Wilensky的原始建议(PW),我们自己的7种静态变体以及一种新的动态方法。我们在10个月的时间里和在TREC数据集上检查了它们在Web上的性能,评估了它们在以下方面的能力:(1)唯一标识原始(可能已修改)文档,以及(2)找到其他相关文档(如果原始)迷路了。选择以最小化文档频率(DF)的词法签名擅长于唯一标识,但是在查找相关文档方面却很差。 PW在相对较小的TREC数据集上运行良好,但其行为几乎与包含数十亿文档的Web DF相同。基于术语频率的词法签名(TF)非常易于计算,并且通常表现良好,但高度依赖于所用搜索引擎的排名系统。基于频率逆文档频率(TFIDF)的方法和混合方法(将DF与TF或TFIDF结合使用)似乎是生成有效词法签名的静态方法中最有希望的候选者。我们提出了一种动态LS生成器,称为Test&Select(TS),以减轻LS冲突。在三个不同的搜索引擎上,TS在提取所需文档和查找相关信息方面均胜过所有八种静态方法。当编辑语料库中的文档时,所有LS方法都表现出明显的性能下降。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号