首页> 外文学位 >Performance Evaluation of Probabilistic Latent Semantic Analysis for Unstructured Social Media Data.
【24h】

Performance Evaluation of Probabilistic Latent Semantic Analysis for Unstructured Social Media Data.

机译:非结构化社交媒体数据的概率潜在语义分析的性能评估。

获取原文
获取原文并翻译 | 示例

摘要

Big data analytics is being applied in many fields today to mine unstructured data such as social media blogs or medical records. We focus this thesis on two popular analysis techniques, the methods of Latent Semantic Analysis(LSA) and Probabilistic Latent Semantic Analysis(PLSA), both used for interpreting or extracting concepts and relationships from data. As a use case, we propose to compare their performances in identifying communities from Twitter data sets during natural disasters such as Hurricanes. Latent semantic analysis uses statistical computations, typically singular value decomposition, to find semantic or contextual meaning from the data. It finds relationships between terms and concepts in an unstructured data set. Probabilistic latent semantic analysis or indexing is another method based on Bayesian analysis that typically is used for two-mode data.;The objective is to compare these two methods on a large set of social media documents related to Hurricane Sandy in order to form clusters of similar concepts. We then compare the performance of these two methods to determine their relative performance in determining communities and hidden topics, e.g. finding clusters of similar topics like power outages, floods, gas outages, etc. We apply two clustering methods, K-Means and Affinity Propagation to form clusters in the data. Finally, we present the results by applying external methods of evaluation after creating a test data-set to compare the performance of these two methods. Metrics like Precision, Recall, confusion matrix are used to evaluate the performance of our system. The evaluation showed us that in almost all the scenarios, PLSA works better than LSA in finding out hidden relationships and structures. Whereas LSA is slightly faster than PLSA.
机译:如今,大数据分析已应用于许多领域,以挖掘非结构化数据,例如社交媒体博客或医疗记录。本文将重点放在两种流行的分析技术上,即潜在语义分析(LSA)和概率潜在语义分析(PLSA),它们均用于解释或提取数据中的概念和关系。作为一个用例,我们建议在飓风等自然灾害期间比较他们从Twitter数据集中识别社区的性能。潜在语义分析使用统计计算(通常是奇异值分解)来从数据中找到语义或上下文含义。它在非结构化数据集中找到术语和概念之间的关系。概率潜在语义分析或索引是另一种基于贝叶斯分析的方法,通常用于双模式数据。;目标是在与飓风桑迪有关的大量社交媒体文档上比较这两种方法,以形成类似的概念。然后,我们比较这两种方法的效果,以确定它们在确定社区和隐藏主题(例如,查找类似主题的集群,例如停电,洪水,煤气中断等。我们应用两种聚类方法,即K均值和亲和传播,以在数据中形成集群。最后,在创建测试数据集以比较这两种方法的性能之后,我们通过应用外部评估方法来介绍结果。诸如精度,召回率,混淆矩阵之类的指标用于评估系统的性能。评估显示,在几乎所有情况下,PLSA在发现隐藏的关系和结构方面都比LSA更好。而LSA比PLSA快一点。

著录项

  • 作者

    Prakash, Bharat.;

  • 作者单位

    University of Maryland, Baltimore County.;

  • 授予单位 University of Maryland, Baltimore County.;
  • 学科 Computer science.
  • 学位 M.S.
  • 年度 2014
  • 页码 63 p.
  • 总页数 63
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号