首页> 外文期刊>Computational Social Systems, IEEE Transactions on >Mitigating the Impact of Data Sampling on Social Media Analysis and Mining
【24h】

Mitigating the Impact of Data Sampling on Social Media Analysis and Mining

机译:缓解数据抽样对社交媒体分析和采矿的影响

获取原文
获取原文并翻译 | 示例
           

摘要

The last decade has witnessed the explosive growth of online social media in users and contents. Due to the unprecedented scale and the cascading power of the underlying social networks, social media has created a new paradigm for sharing information, broadcasting breaking news, and reporting real-time events by any user from anywhere at any time. Many popular social media sites including Twitter provide streaming data services by standard APIs to the broad researcher and developer communities. Given the sheer data volume, rapid velocity, and feature variety of online social media, these sites often supply only a sampled set of streaming data, rather than the full data set to reduce the resource cost of computations, storage, and network bandwidth. In light of the substantial impact of sampling in Twitter data stream, this article explores a combination of spectral clustering, locality-sensitive hashing (LSH), latent Dirichlet allocation (LDA) topic modeling, and differential equation modeling to mitigate the impact of sampling on social media data analysis, in particular on detecting real-world events and predicting information diffusion. Our extensive experiments demonstrate that our proposed method is able to detect effectively the real-time emerging events and predict accurately the cascading pattern of these events from the 1% sampled Twitter data stream. To the best of our knowledge, this article is the first effort to introduce a systematic methodology to study and mitigate the impact of data sampling on social media analysis and mining.
机译:过去十年目睹了用户和内容在线社交媒体的爆炸性增长。由于潜在的社交网络的前所未有的规模和级联力量,社交媒体已经为共享信息,广播突发新闻和任何用户的任何时间报告了实时事件的新范式。许多流行的社交媒体网站,包括Twitter,通过标准API向广泛的研究员和开发人员社区提供流式数据服务。鉴于纯粹的数据量,快速速度和特征在线社交媒体,这些网站通常仅提供采样的流数据集,而不是完整的数据集,以降低计算,存储和网络带宽的资源成本。鉴于采样在Twitter数据流中的实质性影响,本文探讨了频谱聚类,位置敏感散列(LSH),潜在的Dirichlet分配(LDA)主题建模和微分方程模型的组合,以减轻采样的影响社交媒体数据分析,特别是检测真实世界事件和预测信息扩散。我们广泛的实验表明,我们的建议方法能够有效地检测实时新兴事件并从1%采样的Twitter数据流中准确地预测这些事件的级联模式。据我们所知,本文是第一次努力引入系统方法学习和减轻数据采样对社交媒体分析和采矿的影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号