首页> 外文会议> >Exploiting Tag and Word Correlations for Improved Webpage Clustering
【24h】

Exploiting Tag and Word Correlations for Improved Webpage Clustering

机译:利用标签和单词相关性改进网页聚类

获取原文
获取原文并翻译 | 示例

摘要

Automatic clustering of webpages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, webpage clustering algorithms only use features extracted from the page-text. However, the advent of social-bookmarking websites, such as StumbleUpon and Delicious, has led to a huge amount of user-generated content such as the tag information that is associated with the webpages. In this paper, we present a subspace based feature extraction approach which leverages tag information to complement the page-contents of a webpage to extract highly discriminative features, with the goal of improved clustering performance. In our approach, we consider page-text and tags as two separate views of the data, and learn a shared subspace that maximizes the correlation between the two views. Any clustering algorithm can then be applied in this subspace. We compare our subspace based approach with a number of baselines that use tag information in various other ways, and show that the subspace based approach leads to improved performance on the webpage clustering task. Although our results here are on the webpage clustering task, the same approach can be used for webpage classification as well. In the end, we also suggest possible future work for leveraging tag information in webpage clustering, especially when tag information is present for not all, but only for a small number of webpages.
机译:网页的自动聚类有助于许多信息检索任务,例如改善用户界面,馆藏聚类,在搜索结果中引入多样性等。通常,网页聚类算法仅使用从页面文本中提取的功能。但是,诸如StumbleUpon和Delicious这样的社交书签网站的出现导致了大量用户生成的内容,例如与网页相关联的标签信息。在本文中,我们提出了一种基于子空间的特征提取方法,该方法利用标签信息来补充网页的页面内容,以提取具有高度区分性的特征,从而提高聚类性能。在我们的方法中,我们将页面文本和标签视为数据的两个单独的视图,并学习一个共享的子空间,该共享空间可以最大化两个视图之间的相关性。然后,可以在此子空间中应用任何聚类算法。我们将基于子空间的方法与以各种其他方式使用标签信息的基线进行了比较,并表明基于子空间的方法可提高网页聚类任务的性能。尽管我们的结果是关于网页聚类的,但同样的方法也可以用于网页分类。最后,我们还建议在网页聚类中利用标签信息的未来可能的工作,尤其是当不是针对所有网页,而是针对少数网页显示标签信息时。

著录项

  • 来源
    《》|2010年|p.3-11|共9页
  • 会议地点 Toronto(CA);Toronto(CA);Toronto(CA);Toronto(CA)
  • 作者单位

    School of Computing University of Utah Salt Lake City, Utah, USA;

    School of Computing University of Utah Salt Lake City, Utah, USA;

    VA SLC Healthcare System University of Utah Salt Lake City, Utah, USA;

    Dept. of Computer Science Universty of Maryland College Park, Maryland, USA;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 计算技术、计算机技术;
  • 关键词

    social tagging; webpage clustering;

    机译:社会标签;网页聚类;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号