首页> 外文期刊>PLoS One >The influence of preprocessing on text classification using a bag-of-words representation
【24h】

The influence of preprocessing on text classification using a bag-of-words representation

机译:使用袋式表示预处理预处理对文本分类的影响

获取原文
           

摘要

Text classification (TC) is the task of automatically assigning documents to a fixed number of categories. TC is an important component in many text applications. Many of these applications perform preprocessing. There are different types of text preprocessing, e.g., conversion of uppercase letters into lowercase letters, HTML tag removal, stopword removal, punctuation mark removal, lemmatization, correction of common misspelled words, and reduction of replicated characters. We hypothesize that the application of different combinations of preprocessing methods can improve TC results. Therefore, we performed an extensive and systematic set of TC experiments (and this is our main research contribution) to explore the impact of all possible combinations of five/six basic preprocessing methods on four benchmark text corpora (and not samples of them) using three ML methods and training and test sets. The general conclusion (at least for the datasets verified) is that it is always advisable to perform an extensive and systematic variety of preprocessing methods combined with TC experiments because it contributes to improve TC accuracy. For all the tested datasets, there was always at least one combination of basic preprocessing methods that could be recommended to significantly improve the TC using a BOW representation. For three datasets, stopword removal was the only single preprocessing method that enabled a significant improvement compared to the baseline result using a bag of 1,000-word unigrams. For some of the datasets, there was minimal improvement when we removed HTML tags, performed spelling correction or removed punctuation marks, and reduced replicated characters. However, for the fourth dataset, the stopword removal was not beneficial. Instead, the conversion of uppercase letters into lowercase letters was the only single preprocessing method that demonstrated a significant improvement compared to the baseline result. The best result for this dataset was obtained when we performed spelling correction and conversion into lowercase letters. In general, for all the datasets processed, there was always at least one combination of basic preprocessing methods that could be recommended to improve the accuracy results when using a bag-of-words representation.
机译:文本分类(TC)是自动将文档分配给固定数量的类别的任务。 TC是许多文本应用中的重要组成部分。其中许多应用程序执行预处理。有不同类型的文本预处理,例如,大写字母转换为小写字母,HTML标记删除,停止删除,标点符号删除,lemmatization,校正常见拼写错误的单词,以及复制字符的减少。我们假设不同组合的预处理方法的应用可以改善TC结果。因此,我们进行了广泛且系统的TC实验集(这是我们的主要研究贡献),探讨了使用三个基准文本语料库(使用三个基本预处理方法的所有可能组合的影响ML方法和培训和测试集。一般的结论(至少用于数据集验证)是,始终建议使用与TC实验相结合的广泛和系统各种预处理方法,因为它有助于提高TC精度。对于所有测试数据集,总是至少有一种基本预处理方法组合,可以建议使用弓形表示显着改善TC。对于三个数据集,StopWord删除是唯一使用1,000字Unigrams的袋子相比能够实现显着改进的单个预处理方法。对于某些数据集,当我们删除HTML标记时,执行拼写校正或删除标点符号以及减少复制字符时,有最简化的改进。但是,对于第四个数据集,删除删除并不有益。相反,大写字母转换为小写字母是唯一符合基线结果相比显着改进的单个预处理方法。当我们执行拼写校正和转换为小写字母时,获得了此数据集的最佳结果。通常,对于处理的所有数据集,总是至少有一个基本预处理方法组合,可以建议在使用单词袋式表示时提高准确性结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号