首页> 外文学位 >An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures
【24h】

An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures

机译:使用文本挖掘和相似度度量的神圣文本之间的自动相似度检测引擎

获取原文
获取原文并翻译 | 示例

摘要

Is there any similarity between the contexts of the Holy Bible and the Holy Quran, and can this be proven mathematically? The purpose of this research is using the Bible and the Quran as our corpus, we explore the performance of various feature extraction and machine learning techniques. The unstructured nature of text data adds an extra layer of complexity in the feature extraction task, and the inherently sparse nature of the corresponding data matrices makes text mining a distinctly difficult task. Among other things, We assess the difference between domain-based syntactic feature extraction and domain-free feature extraction, and then use a variety of similarity measures like Euclidean, Hillinger, Manhattan, cosine, Bhattacharyya, symmetries kullback-leibler, Jensen Shannon, probabilistic chi-square and clark. For a similarity to identify similarities and differences between sacred texts.;Initially I started by comparing chapters of two raw text using the proximity measures to visualize their behaviors on high dimensional and spars space. It was apparent there was similarity between some of the chapters, but it was not conclusive. Therefore, there was a need to clean the noise using the so called Natural Language processing (NLP). For example, to minimize the size of two vectors, We initiated lists of similar vocabulary that worded differently in both texts but indicates the same exact meaning. Therefore, the program would recognize Lord as God in the Holy Bible and Allah as God in the Quran and Jacob as prophet in bible and Yaqub as a prophet in Quran.;This process was completed many times to give relative comparisons on a variety of different words. After completion of the comparison of the raw texts, the comparison was completed for the processed text. The next comparison was completed using probabilistic topic modeling on feature extracted matrix to project the topical matrix into low dimensional space for more dense comparison.;Among the distance measures intrdued to the sacred corpora, the analysis of similarities based on the probability based measures like Kullback leibler and Jenson shown the best result. Another similarity result based on Hellinger distance on the CTM also shows good discrimination result between documents.;This work started with a believe that if there is intersection between Bible and Quran, it will be shown clearly between the book of Deuteronomy and some Quranic chapters. It is now not only historically, but also mathematically is correct to say that there is much similarity between the Biblical and Quranic contexts more than the similarity within the holy books themselves. Furthermore, it is the conclusion that distances based on probabilistic measures such as Jeffersyn divergence and Hellinger distance are the recommended methods for the unstructured sacred texts.
机译:《圣经》和《古兰经》之间是否有相似之处,并且可以通过数学方式证明吗?这项研究的目的是使用圣经和古兰经作为我们的语料库,我们探索各种特征提取和机器学习技术的性能。文本数据的非结构化性质在特征提取任务中增加了一层额外的复杂性,并且相应数据矩阵的固有稀疏性质使文本挖掘成为一项非常困难的任务。除其他事项外,我们评估基于域的句法特征提取与无域特征提取之间的差异,然后使用各种相似性度量,例如欧几里得,希林格,曼哈顿,余弦,Bhattacharyya,对称kullback-leibler,詹森·香农,概率论卡方和克拉克。为了相似,以识别神圣文本之间的相似性和差异。最初,我首先使用接近性度量比较两个原始文本的章节,以可视化它们在高维和稀疏空间上的行为。显然有些章节之间有相似之处,但不是结论性的。因此,需要使用所谓的自然语言处理(NLP)来消除噪声。例如,为了最小化两个向量的大小,我们启动了相似词汇表,在两个文本中措词不同,但指示的确切含义相同。因此,该程序将在圣经中承认耶和华是上帝,在古兰经中承认真主,在圣经中将雅各布定为先知,在古兰经中将雅古布定为先知;该过程完成了很多次,可以对各种不同的事物进行相对比较。话。完成原始文本的比较后,将完成对已处理文本的比较。在特征提取矩阵上使用概率主题建模将主题矩阵投影到低维空间以进行更密集的比较,从而完成下一次比较;在神圣语料库中引入的距离测度中,基于基于概率的测度(如Kullback)进行相似性分析leibler和Jenson表现最好。另一个基于CTM上的Hellinger距离的相似性结果也显示出文档之间的良好区分结果。这项工作始于相信,如果圣经与古兰经之间存在交集,那么申命记与某些古兰经章节之间就会清楚地表明这一点。现在,不仅在历史上,而且在数学上是正确的,圣经和古兰经之间的相似之处远胜于圣经本身之间的相似之处。此外,得出的结论是,基于概率测度(例如Jeffersyn散度和Hellinger距离)的距离是非结构化神圣文本的推荐方法。

著录项

  • 作者单位

    Rochester Institute of Technology.;

  • 授予单位 Rochester Institute of Technology.;
  • 学科 Statistics.;Computer science.;Mathematics.
  • 学位 M.S.
  • 年度 2014
  • 页码 104 p.
  • 总页数 104
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 公共建筑;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号