...
首页> 外文期刊>Computer Networks >Focused crawing: a new approach to topic-specific Web resource discovery
【24h】

Focused crawing: a new approach to topic-specific Web resource discovery

机译:关注焦点:发现特定主题的Web资源的新方法

获取原文
获取原文并翻译 | 示例
           

摘要

The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. To achieve such goal-directed crawling, we designed tow hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. We report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering va
机译:万维网的快速增长对通用爬虫和搜索引擎提出了前所未有的扩展挑战。在本文中,我们描述了一种称为聚焦爬虫的新超文本资源发现系统。专注的搜寻器的目标是有选择地查找与预定义主题集相关的页面。不使用关键字而是使用示例性文档来指定主题。专注的爬网程序不是收集和索引所有可访问的Web文档以能够回答所有可能的即席查询,而是分析其爬网边界以查找最可能与该爬网最相关的链接,并避免Web的不相关区域。这样可以节省大量的硬件和网络资源,并有助于使爬网保持最新状态。为了实现这种以目标为导向的爬网,我们设计了两个可引导爬网程序的超文本挖掘程序:一个用于评估超文本文档与焦点主题相关性的分类器,以及一个用于识别超文本节点的蒸馏器,这些超文本节点是许多对象的最佳访问点几个链接中的相关页面。我们报告了使用不同主题特异性的多个主题进行的广泛的集中爬行实验。重点爬网稳定地获取相关页面,而标准爬网很快就迷路了,即使它们是从相同的根集开始的。集中爬网对于URL起始集中的大扰动具有鲁棒性。尽管存在这些干扰,它仍会发现大量重叠的资源。它也能够探索和发现水

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号