class='head no_bottom_margin' id='sec1title'>Int'/> A Learning-Based Method for LncRNA-Disease Association Identification Combing Similarity Information and Rotation Forest
首页> 美国卫生研究院文献>iScience >A Learning-Based Method for LncRNA-Disease Association Identification Combing Similarity Information and Rotation Forest
【2h】

A Learning-Based Method for LncRNA-Disease Association Identification Combing Similarity Information and Rotation Forest

机译:相似度信息和轮作林的基于学习的LncRNA-疾病关联识别方法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

class="head no_bottom_margin" id="sec1title">IntroductionLong non-coding RNAs (lncRNAs) are an important class of transcripts, with the length longer than 200 nt, which participates in various physiological processes, such as immune surveillance, post-translational regulation, cell differentiation, proliferation, apoptosis, and epigenetic regulation. Especially, accumulating studies have indicated that a large number of lncRNAs are involved in numerous complex human diseases, such as various cancers (, ), blood diseases (, , ), and neurodegeneration diseases (). Therefore, inferring the potential association between lncRNA and disease is helpful to understand the pathogenesis of complex diseases at the molecular level and provide new insights into the diagnosis, treatment, and prognosis of diseases.Profit from the development of high-throughput experimental techniques, such as Microarray, Northern blots and qPCR, Fluorescence in situ hybridization, RNA interference, and RNA immunoprecipitation (), a large amount of data about lncRNAs-disease associations have been determined and distributed in different public databases, such as lncRNAdb (), NRED (), and NONCODE (). However, although experimentally validated lncRNA-disease associations drive research and development of medical molecular biology, they often have high false positives and false negatives. Moreover, many experimental methods are expensive and time-consuming. Consequently, it is essential to develop a computational prediction approach based on the accumulated biological data to accurately and rapidly find potential lncRNAs-disease associations. Computational method can quantitatively describe the associations between lncRNAs and diseases and efficiently screen out the most promising lncRNA-disease association pairs for further biological experimental validation.The proposed computational method for predicting lncRNA-disease association can be roughly divided into three categories. Methods in the first category uncover ncRNA-disease associations based on the idea of network or link prediction. The underlying assumption is that lncRNAs associated with the same or similar diseases are more likely to have similar functions. Liao et al. constructed a coding-non-coding gene co-expression network based on public microarray expression profiles to discover the potential functions of lncRNA (). Yang et al. applied a propagation algorithm to predict lncRNA-disease associations by constructing a coding-non-coding gene-disease bipartite network based on known associations between diseases and disease-causing genes (). Chen et al. came up with the model called IRWRLDA to identify potential associations by integrating known lncRNA-disease associations, disease semantic similarity, and various lncRNA similarity measures (). Huang et al. proposed a model called PBMDA to predict microRNA (miRNA)-disease associations by integrating known human miRNA-disease associations, miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity (). Methods in the second category utilize matrix factorization to identify potential lncRNA-disease associations. The basic assumption is that unknown association information can be derived from other known association information. Fu et al. predicted lncRNA-disease associations by decomposing data matrices of heterogeneous data sources into low-rank matrices (). Lu et al. developed a method called SIMCLDA for potential lncRNA-disease association prediction based on inductive matrix completion (href="#bib16" rid="bib16" class=" bibr popnode">Lu et al., 2018). These two types of methods are based on specific assumptions, but these assumptions are not unanimously accepted. Relevant studies have shown that in many cases bio macromolecules with similar structures or ligands do not have the same functions. Matrix factorization approaches will experience dramatic performance degradation when the known associated information is insufficient. In addition, these methods both cannot mine the similarity feature of lncRNA and disease, and consider the inherent logic of the association between lncRNA and disease from the perspective of data-driven. Machine learning models are used in the third category to discover the unknown lncRNA-disease associations. Lan et al. proposed a method called LDAP to identify latent associations between lncRNAs and diseases by using a bagging support vector machine (SVM) classifier based on lncRNA similarity and disease similarity (href="#bib13" rid="bib13" class=" bibr popnode">Lan et al., 2016). Since these methods are the beginning of machine learning application for lncRNA-disease association prediction, there is still much room for improvement in the prediction performance, prediction accuracy of such methods can be still greatly improved by increasing training samples and using more appropriate and advanced learning algorithms. Recently, the accumulation of association data between lncRNA and disease and the development of machine learning technology provide a better opportunity for predicting the association between lncRNA and disease using supervised learning model.Instead of using network-based and matrix factorization-based methods to compute association scores directly, we explored to extract association features from lncRNA-disease pairs by multiple similarity matrices and trained machine learning models in a supervised manner to predict their association. In this study, we proposed a novel supervised computational method named (LDASR) for large-scale lncRNA-disease association prediction based on collaborative filtering and machine learning technologies. First, the feature vectors of the lncRNA-disease pairs were obtained by integrating lncRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity. Second, autoencoder neural network was employed to low the feature dimension and get the optimal feature subspace from the original feature set. Finally, considering the size of training samples and the possible non-linear relationship in input, we trained rotating forest to carry out prediction of LncRNA-Disease Association. The flow of LDASR is represented in href="/pmc/articles/PMC6733997/figure/fig1/" target="figure" class="fig-table-link figpopup" rid-figpopup="fig1" rid-ob="ob-fig1" co-legend-rid="lgnd_fig1">Figure 1. In leave-one-out cross-validation (LOOCV) and five cross-validation to evaluate test data, the proposed LDASR model achieved better results than some previous methods, with AUC of 0.9502 and 0.9428, respectively. The test results show that supervised learning model can achieve better performance.href="/pmc/articles/PMC6733997/figure/fig1/" target="figure" rid-figpopup="fig1" rid-ob="ob-fig1">class="inline_block ts_canvas" href="/core/lw/2.0/html/tileshop_pmc/tileshop_pmc_inline.html?title=Click%20on%20image%20to%20zoom&p=PMC3&id=6733997_gr1.jpg" target="tileshopwindow">target="object" href="/pmc/articles/PMC6733997/figure/fig1/?report=objectonly">Open in a separate windowclass="figpopup" href="/pmc/articles/PMC6733997/figure/fig1/" target="figure" rid-figpopup="fig1" rid-ob="ob-fig1">Figure 1Flowchart of LDASRStep 1: Building three similarity matrices for disease by combining semantic information and Gaussian kernel information. Step 2: Building 1 similarity matrix for lncRNA. Step 3: Extraction of similarity feature vectors for disease and lncRNA from disease similarity matrix and lncRNA similarity matrix. Step 4: Extracting the same number of positive and negative samples from the adjacency matrix to construct the dataset used in this paper. Step 5: Selecting the most valuable features and reducing feature noise by using autoencoder. Step 6: more discriminant feature vectors were put into Rotation Forest ensemble classifier for training, verification, and prediction. The construction of disease semantic matrix can see also href="#mmc1" rid="mmc1" class=" supplementary-material">Figure S1.
机译:<!-fig ft0-> <!-fig @ position =“ anchor” mode =文章f4-> <!-fig mode =“ anchred” f5-> <!-fig / graphic | fig / alternatives / graphic mode =“ anchored” m1-> class =“ head no_bottom_margin” id =“ sec1title”>简介长的非编码RNA(lncRNA)是一类重要的转录本,其长度长度超过200 nt,参与各种生理过程,例如免疫监视,翻译后调控,细胞分化,增殖,凋亡和表观遗传调控。尤其是,越来越多的研究表明,大量的lncRNA参与了许多复杂的人类疾病,例如各种癌症(,),血液疾病(,,和神经退行性疾病)。因此,推断lncRNA与疾病之间的潜在联系有助于从分子水平了解复杂疾病的发病机制,并为疾病的诊断,治疗和预后提供新的见解。作为微阵列,Northern印迹和qPCR,荧光原位杂交,RNA干扰和RNA免疫沉淀()的方法,已经确定了有关lncRNA与疾病关联的大量数据,并将其分布在不同的公共数据库中,例如lncRNAdb(),NRED( )和NONCODE()。但是,尽管经过实验验证的lncRNA-疾病关联推动了医学分子生物学的研究和发展,但它们通常具有较高的假阳性和假阴性。此外,许多实验方法既昂贵又费时。因此,至关重要的是根据累积的生物学数据开发一种计算预测方法,以准确,快速地找到潜在的lncRNA-疾病关联。计算方法可以定量描述lncRNA与疾病之间的关联,并有效地筛选出最有前途的lncRNA-疾病关联对,以进一步进行生物学实验验证。预测lncRNA-疾病关联的计算方法大致可分为三类。第一类方法基于网络或链接预测的思想揭示了ncRNA-疾病关联。基本假设是与相同或相似疾病相关的lncRNA更可能具有相似功能。廖等。基于公共微阵列表达谱构建了一个编码非编码基因共表达网络,以发现lncRNA的潜在功能。杨等。通过基于疾病和致病基因之间的已知关联构建编码-非编码基因-疾病二分网络,应用了一种传播算法来预测lncRNA-疾病关联。陈等人提出了一个称为IRWRLDA的模型,通过整合已知的lncRNA-疾病关联,疾病语义相似性和各种lncRNA相似性度量来识别潜在的关联()。黄等提出了一种称为PBMDA的模型,通过整合已知的人类miRNA-疾病关联,miRNA功能相似性,疾病语义相似性和高斯相互作用谱内核相似性来预测microRNA(miRNA)-疾病关联。第二类方法利用矩阵分解来识别潜在的lncRNA-疾病关联。基本假设是未知关联信息可以从其他已知关联信息中得出。 Fu等通过将异构数据源的数据矩阵分解为低秩矩阵来预测lncRNA-疾病关联。卢等人开发了一种称为SIMCLDA的方法,用于基于归纳矩阵完成的潜在lncRNA-疾病关联预测(href="#bib16" rid="bib16" class=" bibr popnode"> Lu等,2018 )。这两类方法均基于特定的假设,但并未一致接受这些假设。相关研究表明,在许多情况下,具有相似结构或配体的生物大分子不具有相同的功能。当已知的关联信息不足时,矩阵分解方法将导致性能急剧下降。此外,这些方法都不能挖掘lncRNA与疾病的相似性,也不能从数据驱动的角度考虑lncRNA与疾病之间关联的内在逻辑。在第三类中使用机器学习模型来发现未知的lncRNA-疾病关联。 Lan等提出了一种称为LDAP的方法,该方法通过使用基于lncRNA相似性和疾病相似性的装袋支持向量机(SVM)分类器来识别lncRNA与疾病之间的潜在关联(href =“#bib13” rid =“ bib13” class =“ bibr popnode “> Lan等人,2016 )。由于这些方法是机器学习应用于lncRNA-疾病关联预测的开始,在预测性能方面仍有很大的改进空间,通过增加训练样本并使用更合适,更高级的学习算法,仍可以大大提高此类方法的预测准确性。最近,lncRNA与疾病之间的关联数据的积累以及机器学习技术的发展为使用监督学习模型预测lncRNA与疾病之间的关联提供了更好的机会,而不是使用基于网络和基于矩阵分解的方法来计算关联直接评分,我们探索了通过多个相似性矩阵和经过训练的机器学习模型以监督的方式从lncRNA-疾病对中提取关联特征,以预测它们的关联。在这项研究中,我们提出了一种基于协作过滤和机器学习技术的大规模lncRNA-疾病关联预测的新型监督计算方法(LDASR)。首先,通过整合lncRNA功能相似性,疾病语义相似性和高斯相互作用谱内核相似性来获得lncRNA-疾病对的特征载体。其次,采用自动编码器神经网络来降低特征维数并从原始特征集中获得最佳特征子空间。最后,考虑到训练样本的大小以及输入中可能存在的非线性关系,我们训练了旋转森林以进行LncRNA-疾病关联的预测。 LDASR的流程在href =“ / pmc / articles / PMC6733997 / figure / fig1 /” target =“ figure” class =“ fig-table-link figpopup” rid-figpopup =“ fig1” rid-ob =中表示“ ob-fig1” co-legend-rid =“ lgnd_fig1”>图1 。在留一法式交叉验证(LOOCV)和五次交叉验证以评估测试数据中,所提出的LDASR模型取得了比某些先前方法更好的结果,AUC分别为0.9502和0.9428。测试结果表明,监督学习模型可以实现更好的性能。<!-fig ft0-> <!-fig mode = article f1-> href =“ / pmc / articles / PMC6733997 / figure / fig1 / “ target =” figure“ rid-figpopup =” fig1“ rid-ob =” ob-fig1“> <!-fig / graphic | fig / alternatives / graphic mode =” anchored“ m1-> class =” inline_block ts_canvas“ href =” / core / lw / 2.0 / html / tileshop_pmc / tileshop_pmc_inline.html?title = Click%20on%20image%20to%20zoom&p = PMC3&id = 6733997_gr1.jpg“ target =” tileshopwindow“> < target =“ object” href =“ / pmc / articles / PMC6733997 / figure / fig1 /?report = objectonly”>在单独的窗口中打开 class =“ figpopup” href =“ / pmc / articles / PMC6733997 / figure / fig1 /“ target =” figure“ rid-figpopup =” fig1“ rid-ob =” ob-fig1“>图1 <!-标题a7->流程图LDASR步骤1:通过组合语义信息和高斯核信息,为疾病建立三个相似性矩阵。步骤2:为lncRNA构建1个相似性矩阵。步骤3:从疾病相似性矩阵和lncRNA相似性矩阵中提取疾病和lncRNA的相似性特征载体。步骤4:从邻接矩阵中提取相同数量的正样本和负样本,以构建本文使用的数据集。步骤5:使用自动编码器选择最有价值的功能并减少功能噪声。步骤6:将更多可判别特征向量放入Rotation Forest集成分类器中进行训练,验证和预测。疾病语义矩阵的构造也可以参见href="#mmc1" rid="mmc1" class="Supplementary-material">图S1 。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号