...
首页> 外文期刊>Information systems frontiers >A comparison of improving multi-class imbalance for internet traffic classification
【24h】

A comparison of improving multi-class imbalance for internet traffic classification

机译:改善互联网流量分类的多类别不平衡的比较

获取原文
获取原文并翻译 | 示例
           

摘要

Most research of class imbalance is focused on two class problem to date. A multi-class imbalance is so complicated that one has little knowledge and experience in Internet traffic classification. In this paper we study the challenges posed by Internet traffic classification using machine learning with multi-class unbalanced data and the ability of some adjusting methods, including resampling (random under-sampling, random over-sampling) and cost-sensitive learning. Then we empirically compare the effectiveness of these methods for Internet traffic classification and determine which produces better overall classifier and under what circumstances. Main works are as below. (1) Cost-sensitive learning is deduced with MetaCost that incorporates the misclassification costs into the learning algorithm for improving multi-class imbalance based on flow ratio. (2) A new resampling model is presented including under-sampling and over-sampling to make the multi-class training data more balanced. (3) The solution is presented to compare among three methods or to compare three methods with original case. Experiment results are shown on sixteen datasets that flow g-mean and byte g-mean are statistically increased by 8.6 % and 3.7 %; 4.4 % and 2.8 %; 11.1 % and 8.2 % when three methods are compared with original case. Cost-sensitive learning is as the first choice when the sample size is enough, but resampling is more practical in the rest.
机译:迄今为止,大多数关于阶级失衡的研究都集中在两个阶级问题上。多类不平衡非常复杂,以至于在互联网流量分类方面知识和经验很少。在本文中,我们研究了使用机器学习处理多类不平衡数据对互联网流量进行分类带来的挑战,以及一些调整方法的能力,包括重采样(随机欠采样,随机过采样)和成本敏感型学习。然后,我们根据经验比较这些方法对Internet流量分类的有效性,并确定哪种方法可以产生更好的整体分类器,以及在什么情况下。主要作品如下。 (1)使用MetaCost推导成本敏感型学习,该算法将误分类成本合并到学习算法中,以基于流率改善多类不平衡。 (2)提出了新的重采样模型,包括欠采样和过采样,以使多类训练数据更加均衡。 (3)提出了在三种方法之间进行比较或将三种方法与原始情况进行比较的解决方案。实验结果显示在16个数据集上,其中流量g均值和字节g均值在统计上分别增加了8.6%和3.7%; 4.4%和2.8%;将三种方法与原始情况进行比较时分别为11.1%和8.2%。当样本量足够时,成本敏感型学习是首选,但是在其余样本中,重采样更为实用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号