...
首页> 外文期刊>Knowledge-Based Systems >Tutorial on practical tips of the most influential data preprocessing algorithms in data mining
【24h】

Tutorial on practical tips of the most influential data preprocessing algorithms in data mining

机译:数据挖掘中最有影响力的数据预处理算法的实用技巧教程

获取原文
获取原文并翻译 | 示例
           

摘要

Data preprocessing is a major and essential stage whose main goal is to obtain final data sets that can be considered correct and useful for further data mining algorithms. This paper summarizes the most influential data preprocessing algorithms according to their usage, popularity and extensions proposed in the specialized literature. For each algorithm, we provide a description, a discussion on its impact, and a review of current and further research on it. These most influential algorithms cover missing values imputation, noise filtering, dimensionality reduction (including feature selection and space transformations), instance reduction (including selection and generation), discretization and treatment of data for imbalanced preprocessing. They constitute all among the most important topics in data preprocessing research and development. This paper emphasizes on the most well-known preprocessing methods and their practical study, selected after a recent, generic book on data preprocessing that does not deepen on them. This manuscript also presents an illustrative study in two sections with different data sets that provide useful tips for the use of preprocessing algorithms. In the first place, we graphically present the effects on two benchmark data sets for the preprocessing methods. The reader may find useful insights on the different characteristics and outcomes generated by them. Secondly, we use a real world problem presented in the ECDBL'2014 Big Data competition to provide a thorough analysis on the application of some preprocessing techniques, their combination and their performance. As a result, five different cases are analyzed, providing tips that may be useful for readers. (C) 2015 Elsevier B.V. All rights reserved.
机译:数据预处理是一个重要的基础阶段,其主要目标是获取最终数据集,这些数据集被认为是正确的,对进一步的数据挖掘算法很有用。本文根据专业文献中提出的用法,流行性和扩展性,总结了最具影响力的数据预处理算法。对于每种算法,我们都会提供说明,对其影响的讨论,以及对它的当前和进一步研究的概述。这些最具影响力的算法包括缺失值插补,噪声过滤,降维(包括特征选择和空间变换),实例减少(包括选择和生成),离散化和不平衡预处理的数据处理。它们构成了数据预处理研究和开发中最重要的主题。本文着重介绍最著名的预处理方法及其实践研究,这些方法是在有关数据预处理的最新通用书籍之后进行选择的,而本书并未对此进行深入介绍。该手稿还分两部分提供了具有不同数据集的说明性研究,这些研究为使用预处理算法提供了有用的技巧。首先,我们以图形方式展示了预处理方法对两个基准数据集的影响。读者可能会发现关于它们产生的不同特征和结果的有用见解。其次,我们使用ECDBL'2014大数据竞赛中提出的一个现实世界问题,对某些预处理技术的应用,它们的组合及其性能进行全面分析。结果,分析了五种不同的情况,提供了可能对读者有用的提示。 (C)2015 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号