首页> 美国卫生研究院文献>The Scientific World Journal >An Efficient Approach for Web Indexing of Big Data through Hyperlinks in Web Crawling
【2h】

An Efficient Approach for Web Indexing of Big Data through Hyperlinks in Web Crawling

机译:通过Web爬网中的超链接对大数据进行Web索引的一种有效方法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Web Crawling has acquired tremendous significance in recent times and it is aptly associated with the substantial development of the World Wide Web. Web Search Engines face new challenges due to the availability of vast amounts of web documents, thus making the retrieved results less applicable to the analysers. However, recently, Web Crawling solely focuses on obtaining the links of the corresponding documents. Today, there exist various algorithms and software which are used to crawl links from the web which has to be further processed for future use, thereby increasing the overload of the analyser. This paper concentrates on crawling the links and retrieving all information associated with them to facilitate easy processing for other uses. In this paper, firstly the links are crawled from the specified uniform resource locator (URL) using a modified version of Depth First Search Algorithm which allows for complete hierarchical scanning of corresponding web links. The links are then accessed via the source code and its metadata such as title, keywords, and description are extracted. This content is very essential for any type of analyser work to be carried on the Big Data obtained as a result of Web Crawling.
机译:近年来,网络爬取具有巨大的意义,它与万维网的蓬勃发展紧密相关。由于大量Web文档的可用性,Web搜索引擎面临着新的挑战,因此使检索到的结果不适用于分析人员。但是,最近,Web爬网仅专注于获取相应文档的链接。如今,存在各种算法和软件,这些算法和软件用于从Web爬网,必须对其进行进一步处理以供将来使用,从而增加了分析仪的负担。本文着重于爬网链接并检索与它们关联的所有信息,以简化其他用途的处理。在本文中,首先,使用深度优先搜索算法的修改版本从指定的统一资源定位符(URL)抓取链接,该算法允许对相应Web链接进行完整的分层扫描。然后通过源代码访问链接,并提取其元数据(例如标题,关键字和描述)。对于通过Web爬网获得的大数据上进行的任何类型的分析器工作,此内容都是至关重要的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号