首页> 外文学位 >Predictive models of gene regulation.
【24h】

Predictive models of gene regulation.

机译:基因调控的预测模型。

获取原文
获取原文并翻译 | 示例

摘要

The regulation of gene expression plays a central role in the development and function of a living cell. A complex network of interacting regulatory proteins bind specific sequence elements in the genome to control the amount and timing of gene expression. The abundance of genome-scale datasets from different organisms provides an opportunity to accelerate our understanding of the mechanisms of gene regulation. Developing computational tools to infer gene regulation programs from high-throughput genomic data is one of the central problems in computational biology.;In this thesis, we present a new predictive modeling framework for studying gene regulation. We formulate the problem of learning regulatory programs as a binary classification task: to accurately predict the condition-specific activation (up-regulation) and repression (down-regulation) of gene expression. The gene expression response is measured by microarray expression data. Genes are represented by various genomic regulatory sequence features. Experimental conditions are represented by the gene expression levels of various regulatory proteins. We use this combination of features to learn a prediction function for the regulatory response of genes under different experimental conditions. The core computational approach is based on boosting. Boosting algorithms allow us to learn high-accuracy, large-margin classifiers and avoid overfitting. We describe three applications of our framework to study gene regulation: (1) In the GeneClass algorithm, we use a compendium of known transcription factor binding sites and gene expression data to learn a global context-specific regulation program that accurately predicts differential expression. GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree. We introduce a novel robust variant of boosting that improves stability and biological interpretability in the presence of correlated features. We also show how to incorporate genome-wide protein-DNA binding data from ChIP-chip experiments into the framework. (2) In several organisms, the DNA binding sites of many transcription factors are unknown. Hence, automatic discovery of regulatory sequence motifs is required. In the MEDUSA algorithm, we integrate raw promoter sequence data and gene expression data to simultaneously discover cis regulatory motifs ab initio and learn predictive regulatory programs. MEDUSA automatically learns probabilistic representations of motifs and their corresponding target genes. We show that we are able to accurately learn the binding sites of most known transcription factors in yeast. (3) We also design new techniques for extracting biologically and statistically significant information from the learned regulatory models. We use a margin-based score to extract global condition-specific regulomes as well as cluster-specific and gene-specific regulation programs. We develop a post-processing framework for interpreting and visualizing biological information encapsulated in our models.;We show the utility of our framework in analyzing several interesting biological contexts (environmental stress responses, DNA-damage response and hypoxia-response) in the budding yeast Saccharomyces cerevisiae. We also show that our methods can learn regulatory programs and cis regulatory motifs in higher eukaryotes such as worms and humans. Several hypotheses generated by our methods are validated by our collaborators using biochemical experiments. Experimental results demonstrate that our framework is quantitatively and qualitatively predictive. We are able to achieve high prediction accuracy on test data and also generate specific, testable hypotheses.
机译:基因表达的调节在活细胞的发育和功能中起着核心作用。相互作用的调节蛋白的复杂网络结合了基因组中的特定序列元素,以控制基因表达的数量和时间。来自不同生物的丰富的基因组规模的数据集提供了一个机会,可以加快我们对基因调控机制的理解。开发可从高通量基因组数据推断基因调控程序的计算工具是计算生物学的中心问题之一。本文为研究基因调控提供了一种新的预测模型框架。我们将学习调节程序的问题公式化为二元分类任务:准确预测基因表达的条件特异性激活(上调)和抑制(下调)。基因表达反应通过微阵列表达数据测量。基因由各种基因组调控序列特征代表。实验条件由各种调节蛋白的基因表达水平表示。我们使用这些功能的组合来学习在不同实验条件下基因调控反应的预测功能。核心计算方法基于增强。提升算法使我们能够学习高精度,大利润的分类器,并避免过拟合。我们描述了我们的框架在研究基因调控中的三种应用:(1)在GeneClass算法中,我们使用已知转录因子结合位点和基因表达数据的纲要来学习可准确预测差异表达的全局特定环境调控程序。 GeneClass以交替决策树的形式学习预测函数,这是决策树基于边距的概括。我们介绍了一种新颖的增强鲁棒变体,可在相关特征存在的情况下提高稳定性和生物学解释性。我们还将展示如何将来自ChIP芯片实验的全基因组蛋白质-DNA结合数据整合到框架中。 (2)在几种生物中,许多转录因子的DNA结合位点是未知的。因此,需要自动发现调节序列基序。在MEDUSA算法中,我们整合了原始的启动子序列数据和基因表达数据,以从头开始同时发现顺式调控基序,并学习了预测性调控程序。 MEDUSA自动学习基序及其相应靶基因的概率表示。我们表明,我们能够准确地了解酵母中大多数已知转录因子的结合位点。 (3)我们还设计了新技术,可从学习到的监管模型中提取具有生物学和统计学意义的重要信息。我们使用基于余量的分数来提取全局条件特定的调节以及簇特定和基因特定的调节程序。我们开发了一个后处理框架,用于解释和可视化模型中封装的生物信息。;我们展示了该框架在分析发芽酵母中几种有趣的生物学环境(环境胁迫反应,DNA损伤反应和低氧反应)中的效用。酿酒酵母。我们还表明,我们的方法可以学习蠕虫和人类等高等真核生物中的调控程序和顺式调控基序。我们的合作者使用生化实验验证了我们方法产生的几种假设。实验结果表明,我们的框架具有定量和定性预测功能。我们能够在测试数据上实现较高的预测准确性,并生成特定的,可测试的假设。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号