...
首页> 外文期刊>Computational statistics & data analysis >Finding approximate solutions to combinatorial problems with very large data sets using BIRCH
【24h】

Finding approximate solutions to combinatorial problems with very large data sets using BIRCH

机译:使用BIRCH查找具有非常大的数据集的组合问题的近似解决方案

获取原文
获取原文并翻译 | 示例
           

摘要

Computing estimators with good robustness properties generally requires solving highly complex optimization problems. The current state-of-the-art algorithms to find approximate solutions to these problems need to access the data set a large number to times and become unfeasible when the data do not fit in memory. In this paper the BIRCH algorithm is adapted to calculate approximate solutions to problems in this class. For data sets that fit in memory, this approach is able to find approximate Least Trimmed Squares (LTS) and Minimum Covariance Determinant (MCD) estimators that compare very well with those returned by the fast-LTS and fast-MCD algorithms, and in some cases is able to find a better solution (in terms of value of the objective function) than those returned by the fast- algorithms. This methodology can also be applied to the Linear Grouping Algorithm and its robust variant for very large datasets. Finally, results from a simulation study indicate that this algorithm performs comparably well to fast-LTS in simple situations (large data sets with a small number of covariates and small proportion of outliers) and does much better than fast-LTS in more challenging situations without requiring extra computational time. These findings seem to confirm that this approach provides the first computationally feasible and reliable approximating algorithm in the literature to compute the LTS and MCD estimators for data sets that do not fit in memory.
机译:具有良好鲁棒性的计算估计器通常需要解决高度复杂的优化问题。寻找这些问题的近似解决方案的当前最新算法需要大量访问数据集,并且在数据不适合内存时变得不可行。在本文中,BIRCH算法适用于计算此类问题的近似解。对于适合内存的数据集,此方法能够找到近似的最小二乘平方(LTS)和最小协方差决定因素(MCD)估计量,这些估计值与快速LTS和快速MCD算法返回的估计值相比非常好。案例能够找到比快速算法返回的解决方案更好的解决方案(就目标函数的价值而言)。该方法还可以应用于线性分组算法及其针对大型数据集的鲁棒变体。最后,仿真研究的结果表明,该算法在简单情况下(大型数据集,协变量数量少,离群值比例小),其性能与快速L​​TS相当,并且在没有挑战的情况下,与快速LTS相比,性能要好得多需要额外的计算时间。这些发现似乎证实了这种方法在文献中提供了第一个计算上可行且可靠的近似算法,以计算不适合内存的数据集的LTS和MCD估计量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号