首页> 外文会议>IEEE International Congress on Big Data >Automated big security text pruning and classification
【24h】

Automated big security text pruning and classification

机译:自动化大安全性文本修剪和分类

获取原文

摘要

Many security related big data problems, including document, traffic, and system log analysis require analysis of unstructured text. Consider the task of analyzing company documents for secure storage. Some might be too sensitive to put on a public cloud and require private storage with associated backup overhead, some may safe on the cloud in encrypted form, and some may be sufficiently non-sensitive to be stored on the cloud in plain-text without encryption and decryption overhead. Being able to make such categorizations autonomously can significantly strengthen data security, organization, and storage efficiency. In this paper, we analyze several base machine learning based security risk assessment algorithms and develop techniques to improve upon standard algorithms. In particular, we examine labeling document sensitivity, labeling each paragraph in the document with one of three levels of security risk. For evaluation, we use real sensitive texts, from documents leaked by the WikiLeaks organization. We improve upon the base models using probabilistic topic modeling via Latent Dirichlet Analysis to identify samples from impure subtopics in the training set, prior to training a logistic regression classifier.
机译:许多与安全性相关的大数据问题,包括文档,流量和系统日志分析,都需要对非结构化文本进行分析。考虑分析公司文档以安全存储的任务。有些可能过于敏感而无法放置在公共云上,并且需要具有相关备份开销的私有存储,有些可能以加密形式在云上安全,而某些可能不够敏感以至于无需加密即可以纯文本格式存储在云中和解密开销。能够自动进行此类分类可以显着增强数据安全性,组织和存储效率。在本文中,我们分析了几种基于机器学习的安全风险评估算法,并开发了对标准算法进行改进的技术。特别是,我们检查标签文件的敏感性,为文件中的每个段落添加三个级别的安全风险之一。为了进行评估,我们使用了WikiLeaks组织泄露的文档中的真实敏感文本。在训练逻辑回归分类器之前,我们通过潜在Dirichlet分析使用概率主题建模对基础模型进行改进,以从训练集中的不纯副主题中识别出样本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号