首页> 中文期刊> 《数据采集与处理》 >一种基于Tri-training的数据流集成分类算法

一种基于Tri-training的数据流集成分类算法

         

摘要

Data stream classification is one of important research tasks in the field of data mining.Most existing data stream classification algorithms require the labeled data for training.However,there are few labeled data in data streams in real applications.To solve this problem,the labeled data can be obtained by manual labeling,but it is very expensive and time consuming.Considering the unlabeled data are huge and full of information,a data stream ensemble classification algorithm based on Tri-training for labeled and unlabeled data is proposed in this paper.The proposed algorithm divides data stream into chunks by sliding windows and trains base classifiers with Tri-training on the first coming k chunks with labeled and unlabeled data.Then the classifiers are iteratively updated by weighted voting until all unlabeled data are labeled.Meanwhile,the k+1 data chunk is predicted by using the ensemble model of k Tri-training classifiers and the classifier with higher classification error is discarded,which reconstructs a new classifier on current data chunk to update the model.Experiments on 10 UCI data sets show that the proposed algorithm can significantly improve the classification accuracy of data stream even with 80 % unlabeled data in comparison with traditional algorithms.%数据流分类是数据挖掘领域的重要研究任务之一,已有的数据流分类算法大多是在有标记数据集上进行训练,而实际应用领域数据流中有标记的数据数量极少.为解决这一问题,可通过人工标注的方式获取标记数据,但人工标注昂贵且耗时.考虑到未标记数据的数量极大且隐含大量信息,因此在保证精度的前提下,为利用这些未标记数据的信息,本文提出了一种基于Tri-training的数据流集成分类算法.该算法采用滑动窗口机制将数据流分块,在前k块含有未标记数据和标记数据的数据集上使用Tri-training训练基分类器,通过迭代的加权投票方式不断更新分类器直到所有未标记数据都被打上标记,并利用k个Tri-training集成模型对第k+1块数据进行预测,丢弃分类错误率高的分类器并在当前数据块上重建新分类器从而更新当前模型.在10个UCI数据集上的实验结果表明:与经典算法相比,本文提出的算法在含80%未标记数据的数据流上的分类精度有显著提高.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号