首页> 美国卫生研究院文献>BMC Bioinformatics >Multimodal deep representation learning for protein interaction identification and protein family classification
【2h】

Multimodal deep representation learning for protein interaction identification and protein family classification

机译:用于蛋白质相互作用鉴定和蛋白质家族分类的多模式深度表示学习

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Protein-protein interaction (PPI) networks are becoming increasingly crucial for analyzing biomedical functions, retrospecting species evolution and analyzing different compounds that cause diseases. Moreover, comprehending the intrinsic patterns behind PPI networks facilitates the understanding of cancer-related protein-protein interfaces and the topological structures of the cancer networks. Normally, two groups of research methods can be formulated when analyzing PPI networks: computational biology methods and high-throughput experimental methods. Given a PPI network, computational biology methods calculate the distances between proteins according to network theory metrics (e.g. betweenness, centrality, average degree) or machine learning algorithms[ – ]. High-throughput techniques, on the contrary, including yeast two-hybrid screens (Y2Hs)[ ], mass spectrometry protein complex identification (MS-PCI) [ ] and Nuclear Magnetic Resonance (NMR)[ ], etc. pro- duce large amounts of data for constructing primary protein databases. These databases provide primary and rich sources for developing molecular and functional networks. Nevertheless, these genome-based techniques demand expensive wet-lab investment and exhaustive lab work. Also, because of the equipment biases in the experimental environment, the results generated by these genome-based methods are subjected to inevitable inaccuracy. Moreover, compared with the significant amount of protein sequence data, the functional units that have been discovered are comparatively restricted. Previously, traditional machine learning algorithms such as decision trees (DT), naive bayes (NB) and nearest neighbor (NN)[ ] have been utilized efficiently in lots of data mining tasks. Yet, these traditional machine learning techniques lack the capacity of discovering hidden associations and extracting discriminant features from the input complex data. Lately, accompanied with the advancement of AI techniques, deep learning methodologies[ ] extracting non-linear and high dimensional features from the protein sequences [ , ] have emerged as a new tendency. These deep learning techniques and frameworks have been recently applied in tremendous biomedical research fields, biological network analysis, and medical image examination. However, since natural and real-world data distributions are highly complex and multimodal, it is essential to incorporate different modalities and patterns from the data to attain satisfactory performance. Additionally, discovering biological pattern from the graph topology of these protein networks is fundamental in comprehending the functions of the cells and their constitutional proteins. When applying deep learning techniques to biological network analysis, these modalities include topological similarities such as 1 -order similarity, 2 -order similarity, and homology features extracted from protein sequences. Additionally, next-generation sequencing technologies also generate large amounts of DNA/RNA sequences which are then translated into protein peptides in the form of stacked amino acid residues. These protein sequences consist of fundamental molecules which perform biological functions for various species [ – ]. Thus, the functionality of a protein is encoded in the amino acid residues. To recognize the protein functionalities, researchers categorize proteins into various families such that proteins within the same family share similar functions or become the parts on the same pathway. In this paper, we propose a advanced multi-modal deep representation learning framework preserving different modalities to harvest both protein sequence similarity and topological proximity. This framework leverages both relational and physicochemical information from proteins and successfully integrates them using a late feature fusion technique. These concatenated features are provided to the interaction identifier and protein family classifier for the training and testing tasks.
机译:蛋白质-蛋白质相互作用(PPI)网络对于分析生物医学功能,回顾物种进化并分析引起疾病的不同化合物变得越来越重要。此外,理解PPI网络背后的固有模式有助于理解癌症相关的蛋白质-蛋白质界面和癌症网络的拓扑结构。通常,在分析PPI网络时可以制定两组研究方法:计算生物学方法和高通量实验方法。给定一个PPI网络,计算生物学方法会根据网络理论指标(例如,中间性,中心性,平均程度)或机器学习算法来计算蛋白质之间的距离。相反,高通量技术包括酵母双杂交筛选(Y2Hs)[],质谱蛋白复合物鉴定(MS-PCI)[]和核磁共振(NMR)[]等。构建初级蛋白质数据库的数据这些数据库为开发分子和功能网络提供了主要且丰富的资源。然而,这些基于基因组的技术需要昂贵的湿实验室投资和详尽的实验室工作。同样,由于设备在实验环境中的偏见,这些基于基因组的方法产生的结果不可避免地会产生误差。此外,与大量的蛋白质序列数据相比,已发现的功能单元受到了相对限制。以前,传统的机器学习算法(例如决策树(DT),朴素贝叶斯(NB)和最近邻居(NN)[])已在许多数据挖掘任务中得到有效利用。然而,这些传统的机器学习技术缺乏发现隐藏的关联并从输入的复杂数据中提取判别特征的能力。最近,随着AI技术的发展,从蛋白质序列中提取非线性和高维特征的深度学习方法[]出现了新趋势。这些深度学习技术和框架最近已应用于巨大的生物医学研究领域,生物网络分析和医学图像检查中。但是,由于自然和现实世界中的数据分布非常复杂且是多模式的,因此必须结合数据中的不同模式和模式来获得令人满意的性能。此外,从这些蛋白质网络的图拓扑中发现生物学模式对于理解细胞及其组成蛋白的功能至关重要。当将深度学习技术应用于生物网络分析时,这些形式包括拓扑相似性,例如1阶相似性,2阶相似性以及从蛋白质序列中提取的同源性特征。此外,下一代测序技术还产生大量的DNA / RNA序列,然后将其以堆积的氨基酸残基的形式翻译成蛋白质肽。这些蛋白质序列由基本分子组成,这些基本分子对各种物种具有生物学功能[–]。因此,蛋白质的功能性被编码在氨基酸残基中。为了识别蛋白质的功能,研究人员将蛋白质分为多个家族,以使同一家族中的蛋白质具有相似的功能或成为同一途径的组成部分。在本文中,我们提出了一种先进的多模式深度表示学习框架,该框架保留了不同的模式以收获蛋白质序列相似性和拓扑邻近性。该框架利用了蛋白质的相关信息和物理化学信息,并使用后期特征融合技术成功整合了它们。这些级联的功能提供给交互标识符和蛋白质家族分类器,用于训练和测试任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号