首页> 外文期刊>Expert systems with applications >Study of Hellinger Distance as a splitting metric for Random Forests in balanced and imbalanced classification datasets
【24h】

Study of Hellinger Distance as a splitting metric for Random Forests in balanced and imbalanced classification datasets

机译:平衡和不平衡分类数据集随机林分裂度量的地狱距离研究

获取原文
获取原文并翻译 | 示例
           

摘要

Hellinger Distance (HD) is a splitting metric that has been shown to have an excellent performance for imbalanced classification problems for methods based on Bagging of trees, while also showing good performance for balanced problems. Given that Random Forests (RF) use Bagging as one of two fundamental techniques to create diversity in the ensemble, it could be expected that HD is also effective for this ensemble method. The main aim of this article is to carry out an extensive investigation on important aspects about the use of HD in RF, including handling of multi-class problems, hyper-parameter optimization, metrics comparison, probability estimation, and metrics combination. In particular, HD is compared to other commonly used splitting metrics (Gini and Gain Ratio) in several contexts: balanced/imbalanced and two-class/multi-class. Two aspects related to classification problems are assessed: classification itself and probability estimation. HD is defined for two-class problems, but there are several ways in which it can be extended to deal with multi-class and this article studies the performance of the available options. Finally, even though HD can be used as an alternative to other splitting metrics, there is no reason to limit RF to use just one of them. Therefore, the final study of this article is to determine whether selecting the splitting metric using cross-validation on the training data can improve results further. Results show HD to be a robust measure for RF, with some weakness for balanced multi-class datasets (especially for probability estimation). Combination of metrics is able to result in a more robust performance. However, experiments of HD with text datasets show Gini to be more suitable than HD for this kind of problems. (C) 2020 Elsevier Ltd. All rights reserved.
机译:Hellinger距离(HD)是一个分裂度量,已经显示为基于树木袋装的方法具有优异的性能,同时也表现出良好的均衡问题。考虑到随机森林(RF)使用袋装作为在合奏中创造多样性的两个基本技术之一,可以预期HD对该集合方法也有效。本文的主要目标是对RF中使用HD的重要方面进行广泛调查,包括处理多级问题,超参数优化,度量比较,概率估计和度量组合。特别是,在若干上下文中将HD与其他常用的分裂度量(GINI和GAIN比率)进行比较:平衡/不平衡和两级/多级。评估与分类问题有关的两个方面:分类本身和概率估计。 HD为两类问题定义,但有几种方法可以扩展到处理多级,本文研究了可用选项的性能。最后,尽管HD可以用作其他分裂度量的替代品,但没有理由限制RF仅使用其中一个。因此,本文的最终研究是确定使用训练数据上的交叉验证选择分割度量是否可以进一步改善结果。结果显示HD是RF的强大措施,具有平衡多级数据集的一些弱点(特别是对于概率估计)。指标的组合能够导致更强大的性能。然而,HD的实验与文本数据集显示了GINI,比这种问题更适合高清。 (c)2020 elestvier有限公司保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号