What do row and column marginals reveal about your dataset?

机译：行和列边际揭示你的数据集是什么？

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Numerous datasets ranging from group memberships within social networks to purchase histories on e-commerce sites are represented by binary matrices. While this data is often either proprietary or sensitive, aggregated data, notably row and column marginals, is often viewed as much less sensitive, and may be furnished for analysis. Here, we investigate how these data can be exploited to make inferences about the underlying matrix H. Instead of assuming a generative model for H, we view the input marginals as constraints on the dataspace of possible realizations of H and compute the probability density function of particular entries H(i, j) of interest. We do this for all the cells of H simultaneously, without generating realizations, but rather via implicitly sampling the datasets that satisfy the input marginals. The end result is an efficient algorithm with asymptotic running time the same as that required by standard sampling techniques to generate a single dataset from the same dataspace. Our experimental evaluation demonstrates the efficiency and the efficacy of our framework in multiple settings.

机译：众多数据集根据社交网络中的组成员资格来购买电子商务站点上的历史，由二进制矩阵表示。虽然该数据通常是专有的或敏感的，但是聚合数据，显着的行和柱边缘，通常被视为更不敏感，并且可以为分析提供。在这里，我们研究了如何利用这些数据来对底层矩阵H进行推断。而不是假设H的生成模型，我们将输入边缘视为关于H的可能实现的数据的约束，并计算概率密度函数特定条目H（i，j）的兴趣。我们同时为所有单元格的单元格，而不会生成实现，而是通过隐式采样满足输入边缘的数据集。最终结果是一种高效的算法，具有渐近运行时间的算法，与标准采样技术相同，以从同一数据空间生成单个数据集。我们的实验评估展示了我们多种设置中框架的效率和功效。

著录项

来源
《Annual conference on Neural Information Processing Systems》|2013年||共9页
会议地点
作者
Behzad Golshan; John W. Byers; Evimaria Terzi;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类信息处理（信息加工）;
关键词

相似文献

外文文献
中文文献
专利

1. Selecting rows and columns for training support vector regression models with large retail datasets [J] . Gür Ali ?., Yaman K. European Journal of Operational Research . 2013,第3期

机译：选择行和列以训练具有大型零售数据集的支持向量回归模型
2. Benders decomposition and column-and-row generation for solving large-scale linear programs with column-dependent-rows [J] . ?brahim Muter, ?.??lker Birbil, Kerem Bülbül European Journal of Operational Research . 2018,第1期

机译：弯曲分解和列和行生成，用于解决具有列依赖的行的大规模线性程序
3. Simultaneous column-and-row generation for large-scale linear programs with column-dependent-rows [J] . I. Muter, ?.I. Birbil, K. Bülbül Mathematical Programming . 2013,第1a2期

机译：具有列相关行的大规模线性程序的同时行生成
4. What do row and column marginals reveal about your dataset? [C] . Behzad Golshan, John W. Byers, Evimaria Terzi Annual conference on Neural Information Processing Systems . 2013

机译：行和列的边际能揭示关于您的数据集的哪些内容？
5. Parallel implementation and benchmarking in cluster architectures of one-dimensional discrete fourier transforms: A comparison using the row-column algorithm versus a novel formulation based on the bluestein/pseudocirculant algorithm. [D] . Velez Rodriguez, William. 2014

机译：一维离散傅里叶变换的群集体系结构中的并行实现和基准测试：使用行列算法与基于bluestein / pseudocirculant算法的新颖公式进行比较。
6. Screening brief intervention and referral to treatment among homeless and marginally housed primary-care patients in Skid Row [O] . Lillian Gelberg, Ronald M Andersen, Lisa Arangua, 2012

机译：在Skid Row对无家可归者和无住所的初级保健患者进行筛查简短干预并转介治疗
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。

What do row and column marginals reveal about your dataset?

摘要

著录项

相似文献

相关主题

期刊订阅