...
首页> 外文期刊>BMC Bioinformatics >Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis
【24h】

Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis

机译:利用数字形式概念分析对基因组表达数据进行交互式知识发现和数据挖掘

获取原文
           

摘要

Background Gene Expression Data (GED) analysis poses a great challenge to the scientific community that can be framed into the Knowledge Discovery in Databases (KDD) and Data Mining (DM) paradigm. Biclustering has emerged as the machine learning method of choice to solve this task, but its unsupervised nature makes result assessment problematic. This is often addressed by means of Gene Set Enrichment Analysis (GSEA). Results We put forward a framework in which GED analysis is understood as an Exploratory Data Analysis (EDA) process where we provide support for continuous human interaction with data aiming at improving the step of hypothesis abduction and assessment. We focus on the adaptation to human cognition of data interpretation and visualization of the output of EDA. First, we give a proper theoretical background to bi-clustering using Lattice Theory and provide a set of analysis tools revolving around (mathcal {K}) -Formal Concept Analysis ( (mathcal {K}) -FCA), a lattice-theoretic unsupervised learning technique for real-valued matrices. By using different kinds of cost structures to quantify expression we obtain different sequences of hierarchical bi-clusterings for gene under- and over-expression using thresholds. Consequently, we provide a method with interleaved analysis steps and visualization devices so that the sequences of lattices for a particular experiment summarize the researcher’s vision of the data. This also allows us to define measures of persistence and robustness of biclusters to assess them. Second, the resulting biclusters are used to index external omics databases—for instance, Gene Ontology (GO)—thus offering a new way of accessing publicly available resources. This provides different flavors of gene set enrichment against which to assess the biclusters, by obtaining their p -values according to the terminology of those resources. We illustrate the exploration procedure on a real data example confirming results previously published. Conclusions The GED analysis problem gets transformed into the exploration of a sequence of lattices enabling the visualization of the hierarchical structure of the biclusters with a certain degree of granularity. The ability of FCA-based bi-clustering methods to index external databases such as GO allows us to obtain a quality measure of the biclusters, to observe the evolution of a gene throughout the different biclusters it appears in, to look for relevant biclusters—by observing their genes and what their persistence is—to infer, for instance, hypotheses on their function.
机译:背景基因表达数据(GED)分析给科学界带来了巨大挑战,可以将其构建为数据库知识发现(KDD)和数据挖掘(DM)范式。 Biclustering已成为解决此任务的首选机器学习方法,但其无监督的性质使结果评估成为问题。这通常通过基因集富集分析(GSEA)来解决。结果我们提出了一个框架,在该框架中,GED分析被理解为探索性数据分析(EDA)过程,在该过程中,我们为与人类持续的数据交互提供了支持,旨在改善假设假冒和评估的步骤。我们专注于数据解释和EDA输出可视化的适应人类认知。首先,我们为使用莱迪思理论进行双聚类提供了适当的理论背景,并提供了一组围绕( mathcal {K} )-形式概念分析(( mathcal {K} )-FCA)的分析工具。 ,一种用于实值矩阵的格论无监督学习技术。通过使用不同类型的成本结构来量化表达,我们使用阈值获得了基因表达不足和过度表达的层次化双聚类的不同序列。因此,我们提供了一种具有交错分析步骤和可视化设备的方法,以使特定实验的晶格序列能概括研究人员对数据的看法。这也使我们能够定义双群的持久性和鲁棒性的度量以对其进行评估。其次,生成的双峰表用于索引外部组学数据库(例如,基因本体论(GO)),从而提供了一种访问公共资源的新方法。通过根据那些资源的术语获得p-值,从而提供了丰富的基因组富集性,可用来评估双聚类。我们在一个真实的数据示例上说明了勘探程序,该示例确认了先前发布的结果。结论GED分析问题已转变为对一系列晶格的探索,从而能够以一定程度的粒度可视化双峰的层次结构。基于FCA的双聚类方法对外部数据库(例如GO)建立索引的能力使我们能够获得双聚类的质量度量,观察整个不同双聚类中基因的进化,从而寻找相关的双聚类。观察它们的基因以及它们的持久性是什么,例如推断其功能的假设。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号