Text Document Categorization using Enhanced Sentence Vector Space Model and Bi-Gram Text Representation Model Based on Novel Fusion Techniques

Abdisa Demissie Amensisa

摘要

The text document classification tasks passes under the Automatic Classification (also known as pattern Recognition) problem in Machine Learning and Text Mining. It is necessary to classify large text documents into specific classes, to make clear and search simply. Classified data are easy for users to browse. The important issue in usual text document classification is representing the features for classification of an unknown document into predefined categories. The Combination of classifiers is fused together to increase the accuracy classification result in a single text document. This paper states a novel fusion approach to classify text documents by considering ES-VSM and Bigram representation models for text documents. ES-VSM: Enhanced Sentence –Vector Space Model is an advanced feature of the sentence based vector space model and extension to simple VSM will be considered for the constructive representation of text documents. The main objective of the study is to boost the accuracy of text classification by accounting for the features extracted from the text document. The proposed system concatenates two different representation models of the text documents for designing two different classifiers and feeds them as one input to the classifier. An enhanced S-VSM and interval-valued representation model are considered for the effective representation of text documents. A word level neural network Bigram representation of text documents is proposed for effective capturing of semantic information present in the text data. A Proposed approach improves?the overall accuracy?of text document classification to?a significant extent.

机译：在机器学习和文本挖掘中，文本文档分类任务在自动分类（也称为模式识别）问题下。有必要将大型文本文档分类为特定类，以简单和搜索。用户可以轻松浏览分类数据。通常的文本文档分类中的重要问题代表了将未知文档分类为预定义类别的功能。分类器的组合融合在一起，以提高单个文本文档中的精度分类结果。本文规定了一种新的融合方法，通过考虑eS-VSM和BIGRAM表示模型来分类文本文件。 ES-VSM：增强句子 - vector空间模型是基于句子的向量空间模型的高级功能，并考虑了文本文档的建设性表示，将考虑扩展到简单的VSM。该研究的主要目的是通过考虑从文本文档中提取的功能来提高文本分类的准确性。所提出的系统连接了文本文档的两个不同表示模型，用于设计两个不同的分类器，并将它们作为分类器的输入馈送。考虑增强的S-VSM和间隔值表示模型用于文本文档的有效表示。提出了文本文档的大型神经网络BIGRAM表示，以便有效地捕获文本数据中存在的语义信息。建议的方法改善了？整体准确性？文本文件分类到？很大程度上。

Text Document Categorization using Enhanced Sentence Vector Space Model and Bi-Gram Text Representation Model Based on Novel Fusion Techniques

摘要

著录项

相关主题

期刊订阅