首页> 中文期刊> 《模式识别与人工智能》 >基于DOM树层次特征的多记录网页抽取∗

基于DOM树层次特征的多记录网页抽取∗

         

摘要

现有的多记录网页抽取方法通常是对文件对象模型( DOM)树进行整体纵向结构分析,计算的结构相似度普遍偏低,使其不能正确识别记录区域。文中提出基于DOM树层次特征的记录抽取方法,该方法利用DOM树不同层次节点的不同作用对其进行横向分析,将寻找相似子树的问题转换为寻找节点块的相似子块,最后采用双向拓展搜索非重叠重复子块进行记录分隔。实验表明该方法能抽取现有抽取器无法处理的页面,多个数据源的抽取结果验证其有效性。%The existing multirecord webpage extraction methods usually make overall longitudinal analyses of the document object model ( DOM ) tree. The computional structural similarity is always low, and therefore record regions can not be identified correctly. Different from the previous work, a method named data record extraction based on DOM tree hierarchical feature ( DEBHF ) is proposed to make transverse analyses of the DOM tree by distinguishing different roles of nodes at different levels. Thus, the problem of searching similar sub-trees is converted into the problem of searching similar sub-blocks in data blocks. Finally, the two-way search for non-overlapped and repeated sub-blocks is adopted to segment the record regions. Experimental results show that the proposed approach can deal with webpages which can not be obtained by the existing methods and the extraction results of different data sources demonstrate its effectiveness.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号