首页> 外文学位 >A statistical approach for information extraction of biological relationships.
【24h】

A statistical approach for information extraction of biological relationships.

机译:一种统计方法,用于提取生物学关系。

获取原文
获取原文并翻译 | 示例

摘要

Vast amounts of biomedical information are stored in scientific literature, easily accessed through publicly available databases. Relationships among biomedical terms constitute a major part of our biological knowledge. Acquiring such structured information from unstructured literature can be done through human annotation, but is time and resource consuming. As this content continues to rapidly grow, the popularity and importance of text mining for obtaining information from unstructured text becomes increasingly evident. Text mining has four major components. First relevant articles are identified through information retrieval (IR), next important concepts and terms are flagged using entity recognition (ER), and then relationships between these entities are extracted from the literature in a process called information extraction(IE). Finally, text mining takes these elements and seeks to synthesize new information from the literature.;Our goal is information extraction from unstructured literature concerning biological entities. To do this, we use the structure of triplets where each triplet contains two biological entities and one interaction word. The biological entities may include terms such as protein names, disease names, genes, and small-molecules. Interaction words describe the relationship between the biological terms. Under this framework we aim to combine the strengths of three classifiers in an ensemble approach. The three classifiers we consider are Bayesian Networks, Support Vector Machines, and a mixture of logistic models defined by interaction word.;The three classifiers and ensemble approach are evaluated on three benchmark corpora and one corpus that is introduced in this study. The evaluation includes cross validation and cross-corpus validation to replicate an application scenario. The three classifiers are unique and we find that performance of individual classifiers varies depending on the corpus. Therefore, an ensemble of classifiers removes the need to choose one classifier and provides optimal performance.
机译:大量的生物医学信息存储在科学文献中,可通过公共数据库轻松访问。生物医学术语之间的关系构成了我们生物学知识的主要部分。从非结构化文献中获取这种结构化信息可以通过人工注释来完成,但是这是时间和资源的消耗。随着内容的持续快速增长,文本挖掘在从非结构化文本中获取信息的普及和重要性日益明显。文本挖掘有四个主要部分。首先通过信息检索(IR)识别相关文章,然后使用实体识别(ER)标记下一个重要的概念和术语,然后在称为信息提取(IE)的过程中从文献中提取这些实体之间的关系。最后,文本挖掘利用了这些要素,并试图从文献中合成新信息。;我们的目标是从非结构化文献中提取有关生物实体的信息。为此,我们使用三元组的结构,其中每个三元组包含两个生物实体和一个交互词。生物实体可以包括诸如蛋白质名称,疾病名称,基因和小分子的术语。相互作用词描述了生物学术语之间的关系。在此框架下,我们旨在以整体方法结合三个分类器的优势。我们考虑的三个分类器是贝叶斯网络,支持向量机以及由交互词定义的逻辑模型的混合。三个分类器和集成方法是在本研究引入的三个基准语料库和一个语料库上进行评估的。评估包括交叉验证和跨主体验证,以复制应用程序场景。这三个分类器是唯一的,我们发现各个分类器的性能取决于语料库。因此,分类器的集成消除了选择一个分类器并提供最佳性能的需要。

著录项

  • 作者

    Bell, Lindsey R.;

  • 作者单位

    The Florida State University.;

  • 授予单位 The Florida State University.;
  • 学科 Biology Biostatistics.;Biology Bioinformatics.;Statistics.
  • 学位 Ph.D.
  • 年度 2011
  • 页码 88 p.
  • 总页数 88
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号