首页> 外文会议>International Conference on Intelligent and Innovative Computing Applications >Enhanced Search for Arabic Language Using Latent Semantic Indexing (LSI)
【24h】

Enhanced Search for Arabic Language Using Latent Semantic Indexing (LSI)

机译:使用潜在语义索引(LSI)增强了阿拉伯语的搜索

获取原文

摘要

The Vector Space Model (VSM) is a common document representation model that is widely used in data mining and information retrieval (IR) systems. However, this technique poses some challenges such as high dimensional space and semantic loss representation. Therefore, the latent semantic indexing (LSI) is proposed to reduce the feature dimensions and to generate semantic rich features that represent conceptual term-document associations. In particular, LSI has been successfully implemented in search engines and text classification tasks. In this paper, we propose a novel approach to enhance the quality of the retrieved documents in search engines for Arabic language. That is, we propose to use a new extension of the LSI technique instead of just using the standard LSI technique. The LSI method proposed is based on employing the word co-occurrences to form a term-by-document matrix. The proposed method is to be based on the documents evaluating cosine similarity measures for term-by-document matrix. We will empirically evaluate the performance using an Arabic data collection that contains no less than 500 documents with no less than 30,000 unique words. A testing set contains keywords from a specific domain will be used to evaluate the quality of the top 20-30 retrieved documents using different singular values (i.e. different number of dimensions). The results will be judged on the performance of the proposed method as it is compared to the standard LSI.
机译:矢量空间模型(VSM)是广泛用于数据挖掘和信息检索(IR)系统的公共文档表示模型。然而,这种技术造成了一些挑战,例如高维空间和语义损失表示。因此,提出了潜在语义索引(LSI)以减少特征维度并生成代表概念性术语文件关联的语义丰富功能。特别是,LSI已在搜索引擎和文本分类任务中成功实现。在本文中,我们提出了一种新颖的方法来提升阿拉伯语搜索引擎中检索的文件的质量。也就是说,我们建议使用LSI技术的新扩展而不是使用标准LSI技术。提出的LSI方法是基于采用单词共同发生以形成逐个文档矩阵。所提出的方法是基于评估余弦相似度测量的文档,用于逐个文档矩阵。我们将使用不少于500个文档的阿拉伯数据收集来凭借不少于30,000个独特单词的文档来凭经验评估绩效。测试集包含来自特定域的关键字将用于评估使用不同奇异值的顶部20-30检索的文档的质量(即,不同数量的维度)。结果将根据拟议方法进行判断,因为它与标准LSI进行比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号