首页> 中文期刊> 《清华大学学报(英文版)》 >Robots Exclusion and Guidance Protocol

Robots Exclusion and Guidance Protocol

         

摘要

With the rapid development of the Internet,general-purpose web crawlers have increasingly become unable to meet people's individual needs as they are no longer efficient enough to fetch deep web pages.The presence of several deep web pages in the websites and the widespread use of Ajax make it difficult for general-purpose web crawlers to fetch information quickly and efficiently.On the basis of the original Robots Exclusion Protocol (REP),a Robots Exclusion and Guidance Protocol (REGP) is proposed in this paper,by integrating the independent scattered expansions of the original Robots Protocol developed by major search engine companies.Our protocol expands the file format and command set of the REP as well as two labels of the Sitemap Protocol.Through our protocol,websites can express their aspects of requirements for restrictions and guidance to the visiting crawlers,and provide a general-purpose fast access of deep web pages and Ajax pages for the crawlers,and facilitates crawlers to easily obtain the open data on websites effectively with ease.Finally,this paper presents a specific application scenario,in which both a website and a crawler work with support from our protocol.A series of experiments are also conducted to demonstrate the efficiency of the proposed protocol.

著录项

  • 来源
    《清华大学学报(英文版)》 |2016年第6期|643-659|共17页
  • 作者

    Dajie Ge; Zhijun Ding;

  • 作者单位

    Department of Computer Science and Technology, Tongji University,Shanghai 201804, China;

    Department of Computer Science and Technology, Tongji University,Shanghai 201804, China;

  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号