首页> 外文会议>IEEE International Conference on Web Services >Method for Fast and Accurate Extraction of Key Information from Webpages
【24h】

Method for Fast and Accurate Extraction of Key Information from Webpages

机译:一种快速准确地从网页中提取关键信息的方法

获取原文

摘要

As the World Wide Web continues to grow unbounded, users expect intelligent processing and accurate coverage of all its domains. To allow for the same, we present a novel approach to identify and extract key information from web pages with commendable accuracy. We extract important information such as the Title, Main Image, Description, Keywords and FavIcon from a webpage where available, using only the HTML responses without any explicit webpage rendering. The algorithm was modelled to be fast without compromising on its accuracy, is fully automatic, language independent and runs without any human supervision or training. We test our algorithm extensively on over one hundred thousand webpages and successfully extract the key information for 97% of them with an impressive average extraction time of less than 500 milliseconds per webpage.
机译:随着万维网的不断发展,用户期望对它的所有域进行智能处理和准确覆盖。为了达到同样的目的,我们提出了一种新颖的方法来以可嘉的准确性从网页中识别和提取关键信息。我们仅从HTML响应中提取可用信息的重要信息,例如标题,主图像,描述,关键字和FavIcon(仅使用HTML响应,而无需任何显式的网页渲染)。该算法被建模为快速而又不影响其准确性,它是全自动的,独立于语言的,并且无需任何人工监督或培训即可运行。我们在十万个网页上广泛测试了我们的算法,并成功提取了其中97%的关键信息,每个网页的平均提取时间令人印象深刻,不到500毫秒。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号