首页> 外文会议>Annual conference on Neural Information Processing Systems >Large-Scale Sparse Principal Component Analysis with Application to Text Data
【24h】

Large-Scale Sparse Principal Component Analysis with Application to Text Data

机译:大规模稀疏主成分分析及其在文本数据中的应用

获取原文

摘要

Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models.
机译:稀疏PCA提供少量特征的线性组合,从而最大程度地提高了数据间的差异。尽管与PCA相比,稀疏PCA具有明显的优势,例如更好的可解释性,但通常认为它在计算上要昂贵得多。在本文中,我们证明了令人惊讶的事实,即稀疏PCA在实践中可能比PCA容易,并且可以可靠地应用于非常大的数据集。这来自严格的特征消除预处理结果,以及有利的事实,即现实数据中的特征通常具有指数递减的方差,从而可以消除许多特征。我们介绍了一种快速的块坐标上升算法,其计算复杂度比现有的一阶算法好得多。我们提供了涉及数百万个文档和数十万个功能的文本语料库的实验结果。这些结果说明了稀疏PCA如何以用户可解释的方式帮助组织大量文本数据,从而为主题模型提供了一种有吸引力的替代方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号