首页> 外文会议>Mobile Business, 2009. ICMB 2009 >HTML Tree Parsing Algorithm Based on Pre-extracted Data
【24h】

HTML Tree Parsing Algorithm Based on Pre-extracted Data

机译:基于预提取数据的HTML树解析算法

获取原文

摘要

In the paper, a new method of extracting HTML Tree from web pages is proposed. Its main idea is that the parts of web pages which are not easy to parse including tags and attributes should be handled previously, then the remaining parts are tidied and parsed, and then both the two former extracted parts are deposited in the tree. As integrated the tidying process and the parsing process, the new method does not only keep the web data integrity but also simplify the complexity of algorithms. The test shows that it can parse all kinds of web pages and provide concrete fault tolerance mechanisms.
机译:本文提出了一种从网页中提取HTML树的新方法。它的主要思想是,应对不容易解析的网页部分(包括标签和属性)进行事先处理,然后整理其余部分并进行解析,然后将之前提取的两个部分都存储在树中。通过将整理过程和解析过程集成在一起,新方法不仅保持了Web数据的完整性,而且简化了算法的复杂性。测试表明,它可以解析各种网页并提供具体的容错机制。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号