...
首页> 外文期刊>Quality Control, Transactions >Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network
【24h】

Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network

机译:使用单层多功能过滤器卷积神经网络的文档级文本分类

获取原文
获取原文并翻译 | 示例
           

摘要

The rapid growth of electronic documents are causing problems like unstructured data that need more time and effort to search a relevant document. Text Document Classification (TDC) has a great significance in information processing and retrieval where unstructured documents are organized into pre-defined classes. Urdu is the most favorite research language in South Asian languages because of its complex morphology, unique features, and lack of linguistic resources like standard datasets. As compared to short text, like sentiment analysis, long text classification needs more time and effort because of large vocabulary, more noise, and redundant information. Machine Learning (ML) and Deep Learning (DL) models have been widely used in text processing. Despite the major limitations of ML models, like learn directed features, these are the favorite methods for Urdu TDC. To the best of our knowledge, it is the first study of Urdu TDC using DL model. In this paper, we design a large multi-purpose and multi-format dataset that contain more than ten thousand documents organize into six classes. We use Single-layer Multisize Filters Convolutional Neural Network (SMFCNN) for classification and compare its performance with sixteen ML baseline models on three imbalanced datasets of various sizes. Further, we analyze the effects of preprocessing methods on SMFCNN performance. SMFCNN outperformed the baseline classifiers and achieved 95.4%, 91.8%, and 93.3% scores of accuracy on medium, large and small size dataset respectively. The designed dataset would be publically and freely available in different formats for future research in Urdu text processing.
机译:电子文档的快速增长导致异构化数据等问题需要更多的时间和精力来搜索相关文档。文本文档分类(TDC)在信息处理和检索中具有重要意义,其中非结构化文档组织成预定义的类。乌尔都语是南亚语言中最喜欢的研究语言,因为其复杂的形态,独特的特征,以及缺乏标准数据集等语言资源。与短文本相比,如情绪分析,长文本分类需要更多的时间和努力,因为大量的词汇,更多的噪声和冗余信息。机器学习(ML)和深度学习(DL)模型已广泛用于文本处理。尽管ML模型的主要限制,但像学习指示功能一样,这是URDU TDC的最喜欢的方法。据我们所知,它是使用DL模型的URDU TDC研究。在本文中,我们设计了一个包含超过一万份文档的大型多功能和多格式数据集,组织成六个类。我们使用单层Multisize Filters卷积神经网络(SMFCNN)进行分类,并在三个不同尺寸的三个不平衡数据集中使用十六毫升基线模型进行比较。此外,我们分析了预处理方法对SMFCNN性能的影响。 SMFCNN分别表现出基线分类器,并分别在媒体,大型和小型数据集中实现了95.4%,91.8%和93.3%的精度。设计的数据集将以不同的格式公开,可自由地提供用于URDU文本处理的未来研究。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号