首页> 外文会议>Annual neural information processing systems conference >Improving a Page Classifier with Anchor Extraction and Link Analysis
【24h】

Improving a Page Classifier with Anchor Extraction and Link Analysis

机译:使用锚提取和链路分析改进页面分类器

获取原文

摘要

Most text categorization systems use simple models of documents and document collections. In this paper we describe a technique that improves a simple web page classifier's performance on pages from a new, unseen web site, by exploiting link structure within a site as well as page structure within hub pages. On real-world test cases, this technique significantly and substantially improves the accuracy of a bag-of-words classifier, reducing error rate by about half, on average. The system uses a variant of co-training to exploit unlabeled data from a new site. Pages are labeled using the base classifier; the results are used by a restricted wrapper-learner to propose potential "main-category anchor wrappers"; and finally, these wrappers are used as features by a third learner to find a categorization of the site that implies a simple hub structure, but which also largely agrees with the original bag-of-words classifier.
机译:大多数文本分类系统使用简单的文档和文档集合模型。在本文中,我们描述了一种通过在网站内的链路结构以及集线器页内的页面结构中利用链接结构来改进一个简单的网页分类器的技术。在真实的测试用例中,这种技术显着提高了单词袋式分类的准确性,平均降低了误差率大约一半。该系统使用共同培训的变体来利用新网站利用未标记的数据。使用基本分类器标记页面;该结果由受限制的包装草学习者使用,提出潜在的“主类锚包装”;最后,这些包装器用作第三学习者的特征,以找到暗示一个简单的集线器结构的站点的分类,但这也很大程度上与原始的单词袋式分类器同意。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号