首页> 外文会议>European Conference on Machine Learning and Knowledge Discovery in Databases >Bootstrapping Information Extraction from Semi-structured Web Pages
【24h】

Bootstrapping Information Extraction from Semi-structured Web Pages

机译:从半结构化网页引导信息提取

获取原文

摘要

We consider the problem of extracting structured records from semi-structured web pages with no human supervision required for each target web site. Previous work on this problem has either required significant human effort for each target site or used brittle heuristics to identify semantic data types. Our method only requires annotation for a few pages from a few sites in the target domain. Thus, after a tiny investment of human effort, our method allows automatic extraction from potentially thousands of other sites within the same domain. Our approach extends previous methods for detecting data fields in semi-structured web pages by matching those fields to domain schema columns using robust models of data values and contexts. Annotating 2-5 pages for 4-6 web sites yields an extraction accuracy of 83.8% on job offer sites and 91.1% on vacation rental sites. These results significantly outperform a baseline approach.
机译:我们考虑从半结构化网页提取结构化记录的问题,没有每个目标网站所需的人类监督。以前关于此问题的工作需要为每个目标网站或使用脆性启发式来识别语义数据类型的脆弱性努力。我们的方法只需要从目标域中的几个站点的几页注释。因此,经过微小的人类努力投资,我们的方法允许自动提取来自同一域内的潜在数千个其他网站。我们的方法通过使用稳健的数据值和上下文匹配域模式列来扩展以前用于检测半结构化网页中的数据字段。注释2-5页为4-6个网站产生的提取准确性为83.8%,在休闲租赁站点上的91.1%。这些结果显着优于基线方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号