Focused crawing: a new approach to topic-specific Web resource discovery

Soumen Chakrabarti; Martin van den Berg; Byron Dom

首页> 外文期刊>Computer Networks >Focused crawing: a new approach to topic-specific Web resource discovery

【24h】

Focused crawing: a new approach to topic-specific Web resource discovery

机译：关注焦点：发现特定主题的Web资源的新方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. To achieve such goal-directed crawling, we designed tow hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. We report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering va

机译：万维网的快速增长对通用爬虫和搜索引擎提出了前所未有的扩展挑战。在本文中，我们描述了一种称为聚焦爬虫的新超文本资源发现系统。专注的搜寻器的目标是有选择地查找与预定义主题集相关的页面。不使用关键字而是使用示例性文档来指定主题。专注的爬网程序不是收集和索引所有可访问的Web文档以能够回答所有可能的即席查询，而是分析其爬网边界以查找最可能与该爬网最相关的链接，并避免Web的不相关区域。这样可以节省大量的硬件和网络资源，并有助于使爬网保持最新状态。为了实现这种以目标为导向的爬网，我们设计了两个可引导爬网程序的超文本挖掘程序：一个用于评估超文本文档与焦点主题相关性的分类器，以及一个用于识别超文本节点的蒸馏器，这些超文本节点是许多对象的最佳访问点几个链接中的相关页面。我们报告了使用不同主题特异性的多个主题进行的广泛的集中爬行实验。重点爬网稳定地获取相关页面，而标准爬网很快就迷路了，即使它们是从相同的根集开始的。集中爬网对于URL起始集中的大扰动具有鲁棒性。尽管存在这些干扰，它仍会发现大量重叠的资源。它也能够探索和发现水

著录项

来源
《Computer Networks》 |1999年第16期|1623-1640|共18页
作者
Soumen Chakrabarti; Martin van den Berg; Byron Dom;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算机网络;
关键词
Web resource discovery; Classification; Categorization;

机译：Web资源发现;分类;分类;

相似文献

外文文献
中文文献
专利

1. Focused crawling f a new approach to topic-specific Web resource discovery [J] . Soumen Chakrabarti, Martin van den Berg, Byron Dom Computer Networks . 1999,第11a16期

机译：专注于爬网，这是一种用于主题特定的Web资源发现的新方法
2. Approach in High Precision Topic-Specific Resource Discovery on the Web [J] . Ye Wei-guo, Lu Zheng-ding Wuhan University Journal of Natural Sciences . 2004,第1期

机译：Web上高精度主题特定资源发现的方法
3. Semantic-based approach for the discovery of Life Sciences web resources driven by rich user's requirements [J] . Perez Catalan Maria AI communications . 2016,第1期

机译：基于语义的发现丰富用户需求驱动的生命科学网络资源的方法
4. Focused crawling: a new approach to topic-specific Web resource discovery [C] . Soumen Chakrabarti, Martin van den Berg, Byron Dom International world wide web conference . 1999

机译：重点爬行：特定于主题Web资源发现的新方法
5. Automatic discovery and selection of text resources on the Web, towards building a very large-scale and effective metasearch engine, Webscales. [D] . Wu, Zonghuan. 2002

机译：自动发现和选择Web上的文本资源，以构建非常大规模和有效的元搜索引擎Webscales。
6. KnowPulse: A Web-Resource Focused on Diversity Data for Pulse Crop Improvement [O] . Lacey-Anne Sanderson, Carolyn T. Caron, Reynold Tan, -1

机译：KnowPulse：一种集中于多样性数据的Web资源用于改善脉冲作物
7. Focused crawling: a new approach to topic-specific Web resource discovery [O] . CHAKRABARTI, SOUMEN, BERG, MARTIN VAN DEN, DOM, BYRON 1999

机译：重点爬网：主题特定的Web资源发现的新方法

Focused crawing: a new approach to topic-specific Web resource discovery

摘要

著录项

相似文献

相关主题

期刊订阅