KHyperLogLog: Estimating Reidentifiability and Joinability of Large Data at Scale

机译：KHyperLogLog：大规模估计大数据的可识别性和可连接性

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Understanding the privacy relevant characteristics of data sets, such as reidentifiability and joinability, is crucial for data governance, yet can be difficult for large data sets. While computing the data characteristics by brute force is straightforward, the scale of systems and data collected by large organizations demands an efficient approach. We present KHyperLogLog (KHLL), an algorithm based on approximate counting techniques that can estimate the reidentifiability and joinability risks of very large databases using linear runtime and minimal memory. KHLL enables one to measure reidentifiability of data quantitatively, rather than based on expert judgement or manual reviews. Meanwhile, joinability analysis using KHLL helps ensure the separation of pseudonymous and identified data sets. We describe how organizations can use KHLL to improve protection of user privacy. The efficiency of KHLL allows one to schedule periodic analyses that detect any deviations from the expected risks over time as a regression test for privacy. We validate the performance and accuracy of KHLL through experiments using proprietary and publicly available data sets.

机译：了解数据集的隐私相关特征（例如可重新标识性和可连接性）对于数据治理至关重要，但对于大型数据集则可能很难。尽管通过蛮力计算数据特征非常简单，但是大型组织收集的系统和数据的规模要求一种有效的方法。我们提出了KHyperLogLog（KHLL），这是一种基于近似计数技术的算法，可以使用线性运行时和最少的内存来估计超大型数据库的可重识别性和可连接性风险。 KHLL使人们能够定量地测量数据的可识别性，而不是基于专家的判断或人工检查。同时，使用KHLL进行的可连接性分析有助于确保分离匿名数据集和已识别数据集。我们描述了组织如何使用KHLL来改善对用户隐私的保护。 KHLL的效率使您可以安排定期分析，以检测随时间推移与预期风险的任何偏差，以此作为隐私回归测试。我们通过使用专有和公开可用的数据集进行的实验来验证KHLL的性能和准确性。

著录项

来源
《IEEE Symposium on Security and Privacy》|2019年|350-364|共15页
会议地点
作者
Pern Hui Chia; Damien Desfontaines; Irippuge Milinda Perera; Daniel Simmons-Marengo; Chao Li; Wei-Yen Day; Qiushi Wang; Miguel Guevara;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Data privacy; Privacy; Organizations; Measurement; Indexes; Approximation algorithms; Runtime;

机译：数据隐私;隐私;组织;测量;索引;近似算法;运行时;

相似文献

外文文献
中文文献
专利

1. Complexity of estimating multi-way join result sizes for area skewed spatial data [J] . Ho-Hyun Park, Chin-Wan Chung Information Processing Letters . 2000,第3期

机译：估计区域偏斜空间数据的多方联接结果大小的复杂性
2. Uncertainty estimates in point-velocity measurements due to exposure time by functions that assimilate ISO-1088 data & statistically based on the time scale of the very-large-scale flow motions [J] . Gonzalez-Castro Juan A., Lee Kyutae Flow Measurement and Instrumentation . 2020,第期

机译：由于曝光时间通过逐次曝光时间来曝光时间来估计ISO-1088数据的曝光时间和基于非常大的流动动作的时间尺度，因此不确定性
3. Biases in Thorpe-Scale Estimates of Turbulence Dissipation. Part I: Assessments from Large-Scale Overturns in Oceanographic Data [J] . Mater Benjamin D., Venayagamoorthy Subhas K., St Laurent Louis, Journal of Physical Oceanography . 2015,第10期

机译：湍流耗散的索普规模估计中的偏差。第一部分：海洋学数据大规模翻转的评估
4. KHyperLogLog: Estimating Reidentifiability and Joinability of Large Data at Scale [C] . Pern Hui Chia, Damien Desfontaines, Irippuge Milinda Perera, IEEE Symposium on Security and Privacy . 2019

机译：KhyperLoglog：估算规模大数据的重新入住度和可扩展性
5. Estimating adult equivalent scale for nutrition data using polynomial spline model. [D] . Lee, ShinDuk. 2012

机译：使用多项式样条模型估算营养数据的成人等效量表。
6. MapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data [O] . Jingjing Wang, Chen Lin 2015

机译：基于MapReduce的个性化本地敏感哈希用于大规模数据上的相似联接
7. Complexity of Estimating Multi-way Join Result Sizes for Area Skewed Spatial Data [O] . Ho-Hyun Park Chin-Wan, Ho-hyun Park, Chin-wan Chung 2000

机译：估计区域偏斜空间数据的多方联接结果大小的复杂性
8. Estimating Times to Early Failures Using Finite Data to Estimate the Weibull Scale Parameter. [R] . neulieb,robert l. 1977

机译：利用有限数据估计早期失效时间估计威布尔尺度参数。

KHyperLogLog: Estimating Reidentifiability and Joinability of Large Data at Scale

摘要

著录项

相似文献

相关主题

期刊订阅