Crawling the Infinite Web: Five Levels Are Enough

机译：爬行无限的网络：五个层次已足够

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing "infinite" Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 "clicks" away from the start page, to reach 90% of the pages that users actually visit.

机译：大量公开可用的网页是根据请求动态生成的，并且包含指向其他动态生成的页面的链接。通常，这会产生可创建许多页面的网站。在本文中，提出并研究了几种浏览“无限”网站的概率模型。我们使用这些模型来估计爬网程序必须下载多长时间才能下载实际访问的网站内容的很大一部分。所提议的模型已针对多个网站上的页面视图上的真实数据进行了验证，表明从理论上和实践上，爬虫仅需下载几个级别，即可从起始页面下载不超过3到5次“点击”，达到用户实际访问的页面的90％。

著录项

来源
《International Workshop on Algorithms and Models for the Web-Graph(WAW 2004); 20041016; Rome(IT)》|2004年|P.156-167|共12页
会议地点 Rome(IT)
作者
Ricardo Baeza-Yates; Carlos Castillo;
展开▼
作者单位

Center for Web Research, DCC, Universidad de Chile;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算机网络;
关键词

相似文献

外文文献
中文文献
专利

1. GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources [J] . Chih-Yuan Huang, Hao Chang ISPRS International Journal of Geo-Information . 2016,第8期

机译：GeoWeb爬网程序：用于发现地理空间Web资源的可扩展和可扩展的Web爬网框架
2. Medical informatics labor market analysis using web crawling, web scraping, and text mining [J] . Schedlbauer Jurgen, Raptis Georgios, Ludwig Bernd International journal of medical informatics . 2021,第Juna期

机译：医疗信息学劳动力市场分析使用Web爬行，网页刮擦和文本挖掘
3. Optimal Web Page Download Scheduling Policies for Green Web Crawling [J] . Vassiliki Hatzi, B. Barla Cambazoglu, Iordanis Koutsopoulos IEEE Journal on Selected Areas in Communications . 2016,第5期

机译：绿色网页爬网的最佳网页下载调度策略
4. Crawling the Infinite Web: Five Levels Are Enough [C] . Ricardo Baeza-Yates, Carlos Castillo International Workshop on Algorithms and Models for the Web-Graph . 2004

机译：爬行无限网络：五个级别就足够了
5. Crawling the Web: Discovery and maintenance of large-scale Web data. [D] . Cho, Junghoo. 2002

机译：爬行Web：发现和维护大规模Web数据。
6. An Efficient Approach for Web Indexing of Big Data through Hyperlinks in Web Crawling [O] . R. Suganya Devi, D. Manjula, R. K. Siddharth 2015

机译：通过Web爬网中的超链接对大数据进行Web索引的一种有效方法
7. Crawling the infinite Web: five levels are enough [O] . Ricardo Baeza-yates, Carlos Castillo 2004

机译：爬行无限网络：五个级别就足够了

Crawling the Infinite Web: Five Levels Are Enough

摘要

著录项

相似文献

相关主题

期刊订阅