首页> 外文学位 >Improving Web retrieval by mining the HTML tags for keywords and exploring the hyperlink structures of Web pages.
【24h】

Improving Web retrieval by mining the HTML tags for keywords and exploring the hyperlink structures of Web pages.

机译:通过挖掘HTML标记的关键字并探索网页的超链接结构来改善Web检索。

获取原文
获取原文并翻译 | 示例

摘要

The increasing amount of data stored in the World Wide Web (WWW) demands efficient techniques for information retrieval. Search engines often answer queries with millions of URLs and some of them are not directly related to a given inquiry. We explore different aspects of the Web to improve the quality of retrieval results.; We show how to derive a numerical score from three types of links to a given page based on its "prestige". By using such a score, we are able to rank the importance of URLs returned by a search engine.; Similarities among Web documents can be employed to duster and classify Web pages. We define a similarity measure among Web pages and among sets of Web pages using their hyperlink relationships, and then demonstrate how to use this measure to study clustering within a set of pages. Additionally, locations of keywords in the structure of HTML documents are used to find pages similar to a given set of HTML documents. Our findings are used to re-rank those obtained from popular search engines.; Keywords are used to index Web pages and facilitate the search. However, not every document explicitly states its keywords; therefore, an algorithm is needed to discover the keywords from an HTML source file. We claim that there are relationships between the locations of the keywords and HTML tags, and employ data-mining techniques to discover association rules on such relationships; these rules can then be used to discover keywords hidden in documents.
机译:万维网(WWW)中存储的数据量不断增长,需要高效的信息检索技术。搜索引擎通常使用数百万个URL回答查询,其中一些与给定查询没有直接关系。我们探索Web的不同方面,以提高检索结果的质量。我们展示了如何根据给定页面的“信誉”从三种类型的链接得出数值分数。通过使用这样的分数,我们可以对搜索引擎返回的URL的重要性进行排名。 Web文档之间的相似性可用于对Web页面进行除尘和分类。我们使用网页的超链接关系定义网页之间以及网页组之间的相似性度量,然后演示如何使用此度量来研究一组页面内的聚类。另外,HTML文档结构中关键字的位置用于查找与给定HTML文档集相似的页面。我们的发现用于重新排名从热门搜索引擎获得的结果。关键字用于索引网页并促进搜索。但是,并非每个文档都明确声明其关键字。因此,需要一种算法来从HTML源文件中发现关键字。我们声称关键字和HTML标签的位置之间存在关联,并采用数据挖掘技术来发现关于此类关联的关联规则;然后,可以使用这些规则来发现隐藏在文档中的关键字。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号