首页> 外文学位 >Classifier design to improve pattern classification and knowledge discovery for imbalanced datasets.
【24h】

Classifier design to improve pattern classification and knowledge discovery for imbalanced datasets.

机译:分类器设计可改进模式分类和不平衡数据集的知识发现。

获取原文
获取原文并翻译 | 示例

摘要

Imbalanced dataset mining is a nontrivial issue. It has extensive applications in a variety of fields, such as scientific research, medical diagnosis, business, multiple industries, etc. Standard machine learning algorithms fail to produce satisfactory classifiers: they tend to over-fit the larger class but ignore the smaller class.;Numerous algorithms have been developed to handle class imbalance, and limited progress has been achieved in improving prediction accuracy for the smaller class. However, real world datasets may have hidden detrimental characteristics other than class imbalance. Those characteristics usually are dataset specific, and can fail otherwise robust algorithms for other imbalanced datasets. Mining such datasets can only be improved by algorithms tailored to domain characteristics (Weiss, 2004); therefore, it is important and necessary to do exploratory data analysis before classifier design. On the other hand, unmet needs in knowledge discovery, such as lead optimization during drug discovery, demand novel algorithms.;In this study, we have developed a framework for imbalanced dataset mining tailored to data characteristics and adapted to knowledge discovery in chemical datasets. First, we explored the dataset and visualized domain characteristics, and then we designed different classifiers accordingly: for class imbalance, active learning (AL), cost sensitive learning (CSL) and re-sampling methods were designed; for class overlap, Class Boundary Cleaning (CBC) and Class Boundary Mining (CBM) were developed. CBM was also designed for lead optimization: ideally it would detect fine structural differences between different classes of compounds; and these differences could be options for lead optimization.;Methods developed were applied to two datasets, hERG and CPDB. The results from imbalanced hERG liability dataset showed that CBC, CBM and AL were effective in correcting class imbalance/overlap and improving the classifier's performance. Highly predictive models were bui discriminating patterns were discovered; and lead optimization options were proposed. The methodology developed and knowledge discovered will benefit drug discovery, improve hazard test prioritization, risk assessment, and governmental regulatory work on human health and the environmental protection.;Keywords: QSAR, applicability domain, outliers, data mining, data visualization, class imbalance, class overlap, sampling, cost sensitive learning, class boundary cleanining, class boundary minng and active learnining.
机译:不平衡的数据集挖掘是一个不小的问题。它在科学研究,医学诊断,商业,多个行业等各个领域都有广泛的应用。标准的机器学习算法无法产生令人满意的分类器:它们倾向于过度适合较大的类别,却忽略了较小的类别。 ;已经开发了许多算法来处理类别不平衡,并且在提高较小类别的预测准确性方面取得的进展有限。但是,现实世界数据集可能具有类不平衡以外的隐藏有害特性。这些特征通常是特定于数据集的,并且对于其他不平衡的数据集,可能会失去其他健壮的算法。只能通过针对领域特征量身定制的算法来改进此类数据集的挖掘(Weiss,2004);因此,在分类器设计之前进行探索性数据分析非常重要和必要。另一方面,知识发现中未满足的需求,例如药物发现中的前导优化,需要新颖的算法。;在这项研究中,我们开发了一个框架,用于针对数据特征量身定制并适用于化学数据集中知识发现的不平衡数据集挖掘。首先,我们探索了数据集和可视化的领域特征,然后我们相应地设计了不同的分类器:针对班级失衡,设计了主动学习(AL),成本敏感学习(CSL)和重采样方法;对于等级重叠,开发了等级边界清洁(CBC)和等级边界采矿(CBM)。 CBM还设计用于铅优化:理想情况下,它将检测出不同类别化合物之间的精细结构差异;这些差异可能是潜在客户优化的选择。;将开发的方法应用于两个数据集hERG和CPDB。来自不平衡hERG责任数据集的结果表明,CBC,CBM和AL在纠正类别不平衡/重叠和提高分类器性能方面是有效的。建立了高度预测的模型;发现区分模式;并提出线索优化选项。开发的方法和发现的知识将有益于药物发现,改善危害测试的优先级,风险评估以及有关人类健康和环境保护的政府监管工作。关键词:QSAR,适用范围,异常值,数据挖掘,数据可视化,类别不平衡,班级重叠,抽样,成本敏感型学习,班级边界清理,班级边界干预和主动学习。

著录项

  • 作者

    Wang, Kun.;

  • 作者单位

    The University of North Carolina at Chapel Hill.;

  • 授予单位 The University of North Carolina at Chapel Hill.;
  • 学科 Chemistry Pharmaceutical.;Health Sciences Pharmacy.
  • 学位 Ph.D.
  • 年度 2009
  • 页码 204 p.
  • 总页数 204
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号