首页> 外文期刊>International journal of intelligent information and database systems >Mining the web with hierarchical crawlers - a resource sharing based crawling approach
【24h】

Mining the web with hierarchical crawlers - a resource sharing based crawling approach

机译:使用分层爬网程序挖掘Web-一种基于资源共享的爬网方法

获取原文
获取原文并翻译 | 示例
           

摘要

An important component of any web search engine is its crawler, which is also known as robot or spider. An efficient set of crawlers make any search engine more powerful, apart from its other measures of performance, such as its ranking algorithm, storage mechanism, indexing techniques, etc. In this paper, we have proposed an extended technique for crawling over the World Wide Web (WWW) on behalf of a search engine. This is an approach with multiple crawlers working in parallel combined with the mechanism of focused crawling (Chakrabarti et al., 1999a, 2002; Mukhopadhyay et al., 2006). In this approach, the total structure of any website is divided into several number of levels based on the hyperlink-structure for downloading web pages from that website (Chakrabarti et al., 1999b; Mukhopadhyay and Singh, 2004). The number of crawlers of each level is not fixed, rather dynamic in this context. It is determined at execution time on demand basis using threaded program based on the number of hyperlinks of a specific web page. This paper also proposes a focused hierarchical crawling technique, where crawlers are created dynamically at runtime for different domains to crawl the web pages with the essence of resource sharing.
机译:任何网络搜索引擎的重要组成部分都是其搜寻器,也称为机器人或蜘蛛。一组有效的搜寻器,除其排名算法,存储机制,索引技术等其他性能指标外,还可以使任何搜索引擎变得更强大。在本文中,我们提出了一种扩展的技术,可以在全球范围内进行搜寻Web(WWW)代表搜索引擎。这是多个爬虫并行工作的方法,结合了集中爬虫的机制(Chakrabarti等,1999a,2002; Mukhopadhyay等,2006)。在这种方法中,根据用于从该网站下载网页的超链接结构,将任何网站的总结构分为几个级别(Chakrabarti等,1999b; Mukhopadhyay和Singh,2004)。在此情况下,每个级别的搜寻器数量不是固定的,而是动态的。它是根据特定网页的超链接数量在执行时根据需要使用线程程序确定的。本文还提出了一种集中的分层爬网技术,其中在运行时为不同的域动态创建爬网程序,以资源共享的本质来爬网网页。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号