Randomized Algorithms for Matrices and Data

Michael W. Mahoney

摘要

Randomized algorithms for very large matrix problems have received a great deal of attention in recent years. Much of this work was motivated by problems in large-scale data analysis, largely since matrices are popular structures with which to model data drawn from a wide range of application domains, and this work was performed by individuals from many different research communities. While the most obvious benefit of randomization is that it can lead to faster algorithms, either in worst-case asymptotic theory and/or numerical implementation, there are numerous other benefits that are at least as important. For example, the use of randomization can lead to simpler algorithms that are easier to analyze or reason about when applied in counterintuitive settings; it can lead to algorithms with more interpretable output, which is of interest in applications where analyst time rather than just computational time is of interest; it can lead implicitly to regularization and more robust output; and randomized algorithms can often be organized to exploit modern computational architectures better than classical numerical methods. This monograph will provide a detailed overview of recent work on the theory of randomized matrix algorithms as well as the application of those ideas to the solution of practical problems in large-scale data analysis. Throughout this review, an emphasis will be placed on a few simple core ideas that underlie not only recent theoretical advances but also the usefulness of these tools in large-scale data applications. Crucial in this context is the connection with the concept of statistical leverage. This concept has long been used in statistical regression diagnostics to identify outliers; and it has recently proved crucial in the development of improved worst-case matrix algorithms that are also amenable to high-quality numerical implementation and that are useful to domain scientists. This connection arises naturally when one explicitly decouples the effect of randomization in these matrix algorithms from the underlying linear algebraic structure. This decoupling also permits much finer control in the application of randomization, as well as the easier exploitation of domain knowledge. Most of the review will focus on random sampling algorithms and random projection algorithms for versions of the linear least-squares problem and the low-rank matrix approximation problem. These two problems are fundamental in theory and ubiquitous in practice. Randomized methods solve these problems by constructing and operating on a randomized sketch of the input matrix A — for random sampling methods, the sketch consists of a small number of carefully-sampled and rescaled columns/rows of A, while for random projection methods, the sketch consists of a small number of linear combinations of the columns/rows of A. Depending on the specifics of the situation, when compared with the best previously-existing deterministic algorithms, the resulting randomized algorithms have worst-case running time that is asymptotically faster; their numerical implementations are faster in terms of clock-time; or they can be implemented in parallel computing environments where existing numerical algorithms fail to run at all. Numerous examples illustrating these observations will be described in detail.

机译：近年来，针对非常大的矩阵问题的随机算法引起了极大的关注。这项工作的大部分是由于大规模数据分析中的问题而引起的，这在很大程度上是因为矩阵是流行的结构，可以用来对来自广泛应用领域的数据进行建模，并且这项工作是由来自许多不同研究社区的个人进行的。尽管在最坏情况下的渐近理论和/或数值实现中，随机化的最明显好处是可以导致更快的算法，但还有许多其他好处至少同样重要。例如，使用随机化可以导致更简单的算法，这些算法更易于分析或推断何时应用于违反直觉的设置；它可以导致算法具有更可解释的输出，这在关注分析时间而不只是计算时间的应用中很有用；它可以隐式导致正则化和更强大的输出；通常，与传统的数值方法相比，通常可以组织随机算法来更好地利用现代计算体系结构。该专着将详细介绍有关随机矩阵算法理论的最新工作，以及这些思想在大规模数据分析中解决实际问题的应用。在整个审查过程中，重点将放在一些简单的核心思想上，这些思想不仅是近期理论上的进步，而且是这些工具在大规模数据应用中的实用性的基础。在这种情况下，至关重要的是与统计杠杆概念的联系。长期以来，这一概念已用于统计回归诊断中以识别异常值。最近，它被证明对改进的最坏情况矩阵算法的开发至关重要，这些算法也适用于高质量的数值实现，并且对领域科学家很有用。当人们明确地将这些矩阵算法中的随机化作用与底层线性代数结构解耦时，这种联系自然就会产生。这种解耦还允许在随机化应用中进行更精细的控制，并且更容易利用领域知识。对于线性最小二乘问题和低秩矩阵近似问题的版本，大多数审查将集中于随机采样算法和随机投影算法。这两个问题在理论上是根本的，而在实践中是普遍存在的。随机方法通过构造和操作输入矩阵A的随机草图来解决这些问题-对于随机采样方法，该草图由少量经过仔细采样和重新缩放的A的列/行组成，而对于随机投影方法，该方法由草图由A的列/行的少量线性组合组成。根据情况的具体情况，当与以前最好的确定性算法相比时，所得的随机算法在最坏情况下的运行时间会渐近地加快;它们的数字实现在时钟时间方面更快。或者它们可以在现有数值算法根本无法运行的并行计算环境中实现。将详细描述示出这些观察的许多示例。

Randomized Algorithms for Matrices and Data

摘要

著录项

相似文献

相关主题

期刊订阅