首页> 外国专利> Methods for obtaining improved text similarity measures which replace similar characters with a string pattern representation by using a semantic data tree

Methods for obtaining improved text similarity measures which replace similar characters with a string pattern representation by using a semantic data tree

机译:获得改进的文本相似性度量的方法,该方法通过使用语义数据树以字符串模式表示形式替换相似字符

摘要

The embodiments of the invention provide methods for obtaining improved text similarity measures. More specifically, a method of measuring similarity between at least two electronic documents begins by identifying similar terms between the electronic documents. This includes basing similarity between the similar terms on patterns, wherein the patterns can include word patterns, letter patterns, numeric patterns, and/or alphanumeric patterns. The identifying of the similar terms also includes identifying multiple pattern types between the electronic documents. Moreover, the basing of the similarity on patterns identifies terms within the electronic documents that are within a category of a hierarchy. Specifically, the identifying of the terms reviews a hierarchical data tree, wherein nodes of the tree represent terms within the electronic documents. Lower nodes of the tree have specific terms; and, wherein higher nodes of the tree have general terms.
机译:本发明的实施例提供了用于获得改进的文本相似性度量的方法。更具体地说,一种用于测量至少两个电子文档之间的相似性的方法始于识别电子文档之间的相似术语。这包括在相似术语之间基于图案的相似性,其中,图案可以包括单词图案,字母图案,数字图案和/或字母数字图案。相似术语的识别还包括识别电子文档之间的多种图案类型。此外,基于模式的相似性在电子文档中标识层次结构类别内的术语。具体地,术语的识别回顾了分层数据树,其中树的节点表示电子文档内的术语。树的下部节点具有特定术语;并且,其中树的较高节点具有一般术语。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号