首页> 外文学位 >Feature Extraction, Characterization, and Classification of Proteins Using Random Forest.
【24h】

Feature Extraction, Characterization, and Classification of Proteins Using Random Forest.

机译:使用随机森林对蛋白质进行特征提取,表征和分类。

获取原文
获取原文并翻译 | 示例

摘要

Machine learning algorithms have been widely used in bioinformatics to develop computational tools and the usage is still growing due to the growth of the volume of data and availability of computational resources, and invention of newer machine learning algorithms. The important task in this implementation is to fit models to experimentally pre-classified data and then to use these models to make a prediction about an unclassified instance. Since the advent of whole genome sequencing, protein sequences have been increasingly deposited and classified in databases. The objectives of this dissertation are to develop a computational tool for protein feature extraction and to implement random forest based algorithm to solve various bioinformatics problems. The project is motivated by the gap existing in feature extraction tools and the need for improvement to some current prediction methods. Four different tools are developed; the first one is the Feature Extraction from Protein Sequence tool (FEPS), which is an easy-to-use web-based tool that computes the most common protein features and provides features in different output file formats. The other three tools are RF-NR, RF-Phos, and RF-Hydroxysite. RF-NR predicts the subfamilies of nuclear receptor proteins, which represent a large protein superfamily, while RF-Phos and RF-Hydroxysite predicts the sites of post-translational phosphorylation and hydroxylation respectively in protein sequences. These methods were validated and tested rigorously with both cross validation and independent samples. In comparison with the existing ones, our new bioinformatics tools perform equally well or better compared to the existing tools. These tools are available online at Bioinformatics and Computational Biology Lab's website at bcb.ncat.edu.
机译:机器学习算法已广泛用于生物信息学中以开发计算工具,并且由于数据量的增长和计算资源的可用性以及新型机器学习算法的发明,其使用仍在增长。此实现中的重要任务是使模型适合实验性预分类的数据,然后使用这些模型对未分类的实例进行预测。自从全基因组测序问世以来,蛋白质序列已越来越多地存放在数据库中并进行分类。本文的目的是开发一种蛋白质特征提取的计算工具,并实现基于随机森林的算法来解决各种生物信息学问题。该项目的动机是特征提取工具中存在的空白以及对某些当前预测方法进行改进的需要。开发了四种不同的工具;第一个是蛋白质序列特征提取工具(FEPS),这是一个易于使用的基于Web的工具,可计算最常见的蛋白质特征并以不同的输出文件格式提供特征。其他三个工具是RF-NR,RF-Phos和RF-羟基。 RF-NR预测代表大蛋白超家族的核受体蛋白的亚家族,而RF-Phos和RF-羟基位点分别预测蛋白序列中翻译后磷酸化和羟基化的位点。这些方法已通过交叉验证和独立样本进行了严格的验证和测试。与现有工具相比,我们的新生物信息学工具的性能与现有工具相当或更好。这些工具可从生物信息学和计算生物学实验室的网站bcb.ncat.edu在线获得。

著录项

  • 作者

    Ismail, Hamid D.;

  • 作者单位

    North Carolina Agricultural and Technical State University.;

  • 授予单位 North Carolina Agricultural and Technical State University.;
  • 学科 Bioinformatics.;Computer engineering.;Computer science.
  • 学位 Ph.D.
  • 年度 2016
  • 页码 177 p.
  • 总页数 177
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号