...
首页> 外文期刊>ACM transactions on the web >Classification of Layout vs. Relational Tables on theWeb: Machine Learning with Rendered Pages
【24h】

Classification of Layout vs. Relational Tables on theWeb: Machine Learning with Rendered Pages

机译:Classification of Layout vs. Relational Tables on theWeb: Machine Learning with Rendered Pages

获取原文
获取原文并翻译 | 示例
           

摘要

Table mining on the web is an open problem, and none of the previously proposed techniques provides a complete solution. Most research focuses on the structure of the HTML document, but because of the nature and structure of the web, it is still a challenging problem to detect relational tables. Web Content Accessibility Guidelines (WCAG) also cover a wide range of recommendations for making tables accessible, but our previous work shows that these recommendations are also not followed; therefore, tables are still inaccessible to disabled people and automated processing. We propose a new approach to table mining by not looking at the HTML structure, but rather, the rendered pages by the browser. The first task in table mining on theweb is to classify relational vs. layout tables, and here, we propose two alternative approaches for that task. We first introduce our dataset, which includes 725 web pages with 9,957 extracted tables. Our first approach extracts features from a page after being rendered by the browser, then applies several machine learning algorithms in classifying the layout vs. relational tables. The best result is with Random Forest with the accuracy of 97.2% (F1-score: 0.955) with 10-fold cross-validation. Our second approach classifies tables using images taken from the same sources using Convolutional Neural Network (CNN), which gives an accuracy of 95% (F1-score: 0.95). Our work here shows that the web's true essence comes after it goes through a browser and using the rendered pages and tables, the classification is more accurate compared to literature and paves the way in making the tables more accessible.

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号