首页> 外文学位 >Text classification on imbalanced data: Application to systematic reviews automation.
【24h】

Text classification on imbalanced data: Application to systematic reviews automation.

机译:不平衡数据的文本分类:在系统评价自动化中的应用。

获取原文
获取原文并翻译 | 示例

摘要

Systematic Review is the basic process of Evidence-based Medicine, and consequently there is urgent need for tools assisting and eventually automating a large part of this process. In the traditional Systematic Review System, reviewers or domain experts manually classify literatures into relevant class and irrelevant class through a series of systematic review levels. In our work with TrialStat, we apply text classification techniques to a Systematic Review System in order to minimize the human efforts in identifying relevant literatures. In most cases, the relevant articles are a small portion of the Medline corpus. The first essential issue for this task is achieving high recall for those relevant articles. We also face two technical challenges: handling imbalanced data, and reducing the size of the labeled training set.;To address these issues, we first study the feature selection and sample selection bias caused by the skewness data. We then experimented with different feature selection, sample selection, and classification methods to find the ones that can properly handle these problems. In order to minimize the labeled training set size, we also experimented with the active learning techniques. Active learning selects the most informative instances to be labeled, so that the required training examples are reduced while the performance is guaranteed. By using an active learning technique, we saved 86% of the effort required to label the training examples. The best testing result was obtained by combining the feature selection method Modified BNS, the sample selection method clustering-based sample selection and active learning with the Naive Bayes as classifier. We achieved 100% recall for the minority class with the overall accuracy of 58.43%. By achieving work saved over sampling (WSS) as 53.4%, we saved half of the workload for the reviewers.
机译:系统审查是循证医学的基本过程,因此,迫切需要工具来辅助并最终使该过程的大部分自动化。在传统的系统评论系统中,评论者或领域专家通过一系列系统的评论级别将文献手动分类为相关类别和无关类别。在与TrialStat的合作中,我们将文本分类技术应用于系统审查系统,以最大程度地减少人为识别相关文献的工作量。在大多数情况下,相关文章只是Medline语料库的一小部分。这项任务的第一个基本问题是要使那些相关文章获得较高的召回率。我们还面临两个技术挑战:处理不平衡的数据,以及减少标记的训练集的大小。为了解决这些问题,我们首先研究由偏度数据引起的特征选择和样本选择偏差。然后,我们尝试了不同的特征选择,样本选择和分类方法,以找到可以正确处理这些问题的方法。为了最小化标记的训练集大小,我们还尝试了主动学习技术。主动学习选择要提供最多信息的实例,以便在保证性能的同时减少所需的训练实例。通过使用主动学习技术,我们节省了标注训练示例所需的工作量的86%。将特征选择方法Modified BNS,样本选择方法基于聚类的样本选择以及以朴素贝叶斯作为分类器的主动学习相结合,可以获得最佳的测试结果。对于少数群体,我们实现了100%的召回率,总体准确度为58.43%。通过实现节省的工作量超过抽样(WSS)为53.4%,我们为审阅者节省了一半的工作量。

著录项

  • 作者

    Ma, Yimin.;

  • 作者单位

    University of Ottawa (Canada).;

  • 授予单位 University of Ottawa (Canada).;
  • 学科 Computer Science.
  • 学位 M.C.S.
  • 年度 2007
  • 页码 97 p.
  • 总页数 97
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号