首页> 外文学位 >M-InfoSift: A graph-based approach for multiclass document classification.
【24h】

M-InfoSift: A graph-based approach for multiclass document classification.

机译:M-InfoSift:一种基于图的多类文档分类方法。

获取原文
获取原文并翻译 | 示例

摘要

With the increase in the amount of data being introduced into the Internet on a daily basis, the problem of managing these large amount of data is an unavoidable problem. The area of document classification has been examined, explored and experimented as a technique for organizing and managing vast repositories of electronic documents such as emails, text and web pages. Over the past decade, several approaches such as machine learning, data mining, information retrieval and others have been proposed for addressing this problem of classifying electronic documents. While a majority of these techniques rely on extracting high-frequency keywords, they ignore the aspect of extracting groups of related keywords. Additionally, they fail to capture the salient relationships between a number of keywords and their inherent structure, which can prove to be a decisive element in classifying specific types of documents (e.g., web-pages). To this effect, the design of InfoSift was proposed which incorporates graph mining techniques for document classification by using a supervised learning model. Perhaps for the first time it was shown how the structure within a document can be used for classification. It was also shown that the techniques can be applied to different types of documents, such as text, email, and web. This framework focused on identifying representative substructures using graph mining approach and to classify an incoming unknown document to a folder using a ranking mechanism.; However, in the real world, documents are categorized into multiple folders based on varied characteristics (such as multiple folders for different emails or multiple classes for documents). Existing approaches have not used structural relationships with in a document for classification and are based on the occurrence of words. Adopting these approaches within the InfoSift framework do not lead to a feasible solution due to the consideration of group of keywords and their relationships with other words. In order to bridge this gap between the strength of InfoSift and issues of Multi-folder classification, a different technique needs to be investigated.; Hence, in this thesis, we introduce a new approach to extend the abilities of InfoSift to support Multiple categories (folders). A ranking technique to order the representative---common and recurring---structures generated from pre-classified documents to categorize new incoming documents has been presented. This approach is based on a global ranking model that incorporates several factors regarding document classification and overcomes numerous problems while using existing approaches for multiple folder classification in the InfoSift system. A number of parameters which influence the generation of representative substructures in single folder classification are analyzed, re-examined, and adapted to multiple folders. Additional graph representations have been analyzed and their use has been validated experimentally. Exhaustive experiments substantiating the selection of parameters for classification of unknown documents into multiple folders have been conducted for text, emails and web pages.
机译:随着每天引入因特网的数据量的增加,管理这些大量数据的问题是不可避免的问题。文档分类领域已作为一种用于组织和管理大量电子文档库(例如电子邮件,文本和网页)的技术进行了检查,探索和试验。在过去的十年中,已经提出了几种方法,例如机器学习,数据挖掘,信息检索和其他方法来解决对电子文档进行分类的问题。尽管这些技术大多数都依赖于提取高频关键字,但它们却忽略了提取相关关键字组的方面。此外,它们无法捕获多个关键字与其固有结构之间的显着关系,这可以证明是对特定类型的文档(例如网页)进行分类的决定性因素。为此,提出了InfoSift的设计,该设计结合了图形挖掘技术,用于通过监督学习模型进行文档分类。也许是第一次展示了文档中的结构如何用于分类。还显示了该技术可以应用于不同类型的文档,例如文本,电子邮件和Web。该框架的重点是使用图挖掘方法识别代表性的子结构,并使用排名机制将传入的未知文档分类到文件夹中。但是,在现实世界中,根据不同的特征将文档分类到多个文件夹中(例如,用于不同电子邮件的多个文件夹或用于文档的多个类别)。现有的方法没有使用文档中的结构关系进行分类,而是基于单词的出现。由于考虑了关键字组及其与其他单词的关系,因此在InfoSift框架中采用这些方法不会导致可行的解决方案。为了弥合InfoSift的优势与多文件夹分类问题之间的差距,需要研究另一种技术。因此,在本文中,我们引入了一种新的方法来扩展InfoSift的功能以支持多种类别(文件夹)。提出了一种排序技术,用于对从预分类文档生成的代表性结构(常见和重复出现的结构)进行排序,以对新的传入文档进行分类。此方法基于全球排名模型,该模型包含了有关文档分类的多个因素,并克服了许多问题,同时在InfoSift系统中使用现有方法对多个文件夹进行分类。对影响单个文件夹分类中代表性子结构生成的许多参数进行了分析,重新检查,并适用于多个文件夹。其他图形表示形式已经过分析,其使用已通过实验验证。已经针对文本,电子邮件和网页进行了详尽的实验,以证实用于将未知文档分类为多个文件夹的参数的选择。

著录项

  • 作者

    Venkatachalam, Aravind.;

  • 作者单位

    The University of Texas at Arlington.$bComputer Science & Engineering.;

  • 授予单位 The University of Texas at Arlington.$bComputer Science & Engineering.;
  • 学科 Computer Science.
  • 学位 M.S.
  • 年度 2007
  • 页码 109 p.
  • 总页数 109
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号