首页> 外文期刊>Software >An effective and efficient Web content extractor for optimizing the crawling process
【24h】

An effective and efficient Web content extractor for optimizing the crawling process

机译:有效和高效的Web内容提取器,用于优化爬网过程

获取原文
获取原文并翻译 | 示例
       

摘要

Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. But, Web pages contain additional information that can be useful for the crawling process. We have developed a crawler, iCrawler (intelligent crawler), the backbone of which is a Web content extractor that automatically pulls content out of seven different blocks: menus, links, main texts, headlines, summaries, additional necessaries, and unnecessary texts from Web pages. The extraction process consists of two steps, which invoke each other to obtain information from the blocks. The first step learns which HTML tags refer to which blocks using the decision tree learning algorithm. Being guided by numerous sources of information, the crawler becomes considerably effective. It achieved a relatively high accuracy of 96.37% in our experiments of block extraction. In the second step, the crawler extracts content from the blocks using string matching functions. These functions along with the mapping between tags and blocks learned in the first step provide iCrawler with considerable time and storage efficiency. More specifically, iCrawler performs 14 times faster in the second step than in the first step. Furthermore, iCrawler significantly decreases storage costs by 57.10% when compared with the texts obtained through classical HTML stripping.
机译:传统的Web搜寻器在搜寻过程中仅使用超链接信息。但是,专注的搜寻器旨在在下载Web页面之前通过利用单词信息仅下载与给定主题相关的Web页面。但是,网页包含对爬网过程有用的其他信息。我们已经开发了一个搜寻器iCrawler(智能搜寻器),其主干是Web内容提取器,它可以自动从七个不同的块中提取内容:菜单,链接,主要文本,标题,摘要,其他必需品和Web上不必要的文本页面。提取过程包括两个步骤,它们相互调用以从块中获取信息。第一步,使用决策树学习算法来学习哪些HTML标签引用了哪些块。在众多信息来源的指导下,搜寻器变得相当有效。在我们的块提取实验中,它达到了96.37%的相对较高的准确性。在第二步中,搜寻器使用字符串匹配功能从块中提取内容。这些功能以及第一步学习的标签和块之间的映射为iCrawler提供了可观的时间和存储效率。更具体地说,iCrawler在第二步中的执行速度比第一步快14倍。此外,与通过经典HTML剥离获得的文本相比,iCrawler大大降低了57.10%的存储成本。

著录项

  • 来源
    《Software》 |2014年第10期|1181-1199|共19页
  • 作者单位

    Department of Computer Engineering, Corlu Engineering Faculty, Namik Kemal University, Corlu Tekirdag, Turkey;

    Department of Computer Engineering, Campus of Ahmet Karadeniz, Trakya University, Edirne, Turkey;

    Department of Computer Engineering, Campus of Ahmet Karadeniz, Trakya University, Edirne, Turkey;

    Department of Computer Engineering, Campus of Ahmet Karadeniz, Trakya University, Edirne, Turkey;

    Department of Computer Engineering, Campus of Ahmet Karadeniz, Trakya University, Edirne, Turkey;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Web content extraction; Web crawling; classification; intelligent systems;

    机译:Web内容提取;网络爬网;分类;智能系统;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号