The influence of preprocessing on text classification using a bag-of-words representation

Yaakov HaCohen-Kerner; Daniel Miller; Yair Yigal

首页> 外文期刊>PLoS One >The influence of preprocessing on text classification using a bag-of-words representation

【24h】

The influence of preprocessing on text classification using a bag-of-words representation

机译：使用袋式表示预处理预处理对文本分类的影响

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Text classification (TC) is the task of automatically assigning documents to a fixed number of categories. TC is an important component in many text applications. Many of these applications perform preprocessing. There are different types of text preprocessing, e.g., conversion of uppercase letters into lowercase letters, HTML tag removal, stopword removal, punctuation mark removal, lemmatization, correction of common misspelled words, and reduction of replicated characters. We hypothesize that the application of different combinations of preprocessing methods can improve TC results. Therefore, we performed an extensive and systematic set of TC experiments (and this is our main research contribution) to explore the impact of all possible combinations of five/six basic preprocessing methods on four benchmark text corpora (and not samples of them) using three ML methods and training and test sets. The general conclusion (at least for the datasets verified) is that it is always advisable to perform an extensive and systematic variety of preprocessing methods combined with TC experiments because it contributes to improve TC accuracy. For all the tested datasets, there was always at least one combination of basic preprocessing methods that could be recommended to significantly improve the TC using a BOW representation. For three datasets, stopword removal was the only single preprocessing method that enabled a significant improvement compared to the baseline result using a bag of 1,000-word unigrams. For some of the datasets, there was minimal improvement when we removed HTML tags, performed spelling correction or removed punctuation marks, and reduced replicated characters. However, for the fourth dataset, the stopword removal was not beneficial. Instead, the conversion of uppercase letters into lowercase letters was the only single preprocessing method that demonstrated a significant improvement compared to the baseline result. The best result for this dataset was obtained when we performed spelling correction and conversion into lowercase letters. In general, for all the datasets processed, there was always at least one combination of basic preprocessing methods that could be recommended to improve the accuracy results when using a bag-of-words representation.

机译：文本分类（TC）是自动将文档分配给固定数量的类别的任务。 TC是许多文本应用中的重要组成部分。其中许多应用程序执行预处理。有不同类型的文本预处理，例如，大写字母转换为小写字母，HTML标记删除，停止删除，标点符号删除，lemmatization，校正常见拼写错误的单词，以及复制字符的减少。我们假设不同组合的预处理方法的应用可以改善TC结果。因此，我们进行了广泛且系统的TC实验集（这是我们的主要研究贡献），探讨了使用三个基准文本语料库（使用三个基本预处理方法的所有可能组合的影响ML方法和培训和测试集。一般的结论（至少用于数据集验证）是，始终建议使用与TC实验相结合的广泛和系统各种预处理方法，因为它有助于提高TC精度。对于所有测试数据集，总是至少有一种基本预处理方法组合，可以建议使用弓形表示显着改善TC。对于三个数据集，StopWord删除是唯一使用1,000字Unigrams的袋子相比能够实现显着改进的单个预处理方法。对于某些数据集，当我们删除HTML标记时，执行拼写校正或删除标点符号以及减少复制字符时，有最简化的改进。但是，对于第四个数据集，删除删除并不有益。相反，大写字母转换为小写字母是唯一符合基线结果相比显着改进的单个预处理方法。当我们执行拼写校正和转换为小写字母时，获得了此数据集的最佳结果。通常，对于处理的所有数据集，总是至少有一个基本预处理方法组合，可以建议在使用单词袋式表示时提高准确性结果。

著录项

来源
《PLoS One》 |2020年第5期|共22页
作者
Yaakov HaCohen-Kerner; Daniel Miller; Yair Yigal;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类医药、卫生;
关键词

相似文献

外文文献
中文文献
专利

1. Bag-of-words representation for biomedical time series classification [J] . Jin Wang, Ping Liu, Mary F.H. She, Biomedical signal processing and control . 2013,第6期

机译：生物医学时间序列分类的词袋表示
2. THE INFLUENCE OF TEXT PREPROCESSING METHODS AND TOOLS ON CALCULATING TEXT SIMILARITY [J] . ?or?e Petrovi?, Milena Stankovi? Facta Universitatis. Series Mathematics and Informatics . 2019,第5期

机译：文本预处理方法和工具对计算文本相似性的影响
3. Multi-text classification of Urdu/Roman using machine learning and natural language preprocessing techniques [J] . M Ameen Chhajro, Mansoor Ahmed Khuhro, Kamlesh Kumar, Indian Journal of Science and Technology . 2020,第19期

机译：Urdu / Roman使用机器学习和自然语言预处理技术的多文本分类
4. A new text representation scheme combining Bag-of-Words and Bag-of-Concepts approaches for automatic text classification [C] . Alahmadi Alaa, Joorabchi Arash, Mahdi Abdulhussain E. IEEE GCC Conference and Exhibition . 2013

机译：一种结合了词袋和概念袋方法的新文本表示方案，用于自动文本分类
5. IMPROVING CONCEPT REPRESENTATIONS FOR SHORT TEXT CLASSIFICATION [D] . Tao Sijie 2020

机译：改进用于短文本分类的概念表示
6. The influence of preprocessing on text classification using a bag-of-words representation [O] . Yaakov HaCohen-Kerner, Daniel Miller, Yair Yigal, 2020

机译：使用袋式表示预处理预处理对文本分类的影响
7. The influence of preprocessing on text classification using a bag-of-words representation [O] . Yaakov HaCohen-Kerner, Daniel Miller, Yair Yigal 2020

机译：使用袋式表示预处理预处理对文本分类的影响

The influence of preprocessing on text classification using a bag-of-words representation

摘要

著录项

相似文献

相关主题

期刊订阅