首页> 外文期刊>Journal of Bioinformatics and Computational Biology >TWO-PASS IMPUTATION ALGORITHM FOR MISSING VALUE ESTIMATION IN GENE EXPRESSION TIME SERIES
【24h】

TWO-PASS IMPUTATION ALGORITHM FOR MISSING VALUE ESTIMATION IN GENE EXPRESSION TIME SERIES

机译:基因表达时间序列中缺失值估计的两步插补算法

获取原文
获取原文并翻译 | 示例
           

摘要

Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.
机译:基因表达微阵列实验经常生成缺少多个值的数据集。但是,大多数用于基因表达数据的分析,挖掘和分类方法都需要完整的基因阵列值矩阵。因此,在这样的数据集中准确估计缺失值已被认为是一个重要问题,并且已经向生物界提出了几种估算算法。然而,这些方法中的大多数并不特别适合于时间序列表达谱。有鉴于此,我们提出了一种新颖的插补算法,该算法特别适合于估计基因表达时间序列数据中的缺失值。该算法利用动态时间规整(DTW)距离来测量时间表达谱之间的相似性,然后为每个具有缺失值的基因表达谱选择一个专用的候选谱集进行估计。已经考虑了三种不同的基于DTW的插补(DTWimpute)算法:位置插补,邻域插补和两次通过插补。这些最初是在Perl中原型化的,并且已经使用几种不同的参数设置在酵母表达时间序列数据上评估了它们的准确性。实验表明,两次遍历算法始终具有优异的性能,特别是对于丢失条目水平较高的数据集,邻域算法和位置算法。两次加权DTWimpute算法的性能已针对加权K最近邻算法进行了基准测试,该算法已在生物界广泛使用。前一种算法似乎优于后一种算法。受这些发现的启发,清楚地表明了DTW技术在时间序列数据中的缺失值估计的附加值,我们建立了两遍DTWimpute算法的优化C ++实现。该软件还提供了三种不同的初始粗糙插补方法之间的选择。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号