首页> 外国专利> JACCARD SIMILARITY ESTIMATION OF WEIGHTED SAMPLES: SCALING AND RANDOMIZED ROUNDING SAMPLE SELECTION WITH CIRCULAR SMEARING

JACCARD SIMILARITY ESTIMATION OF WEIGHTED SAMPLES: SCALING AND RANDOMIZED ROUNDING SAMPLE SELECTION WITH CIRCULAR SMEARING

机译:加权样本的JACCARD相似度估计:具有圆形度量的缩放和随机圆角样本选择

摘要

The disclosed systems and methods include pre-calculation, per object, of object feature bin values, for identifying close matches between objects, such as text documents, that have numerous weighted features, such as specific-length word sequences. Predetermined feature weights get scaled with two or more selected adjacent scaling factors, and randomly rounded. The expanded set of weighted features of an object gets min-hashed into a predetermined number of feature bins. For each feature that qualifies to be inserted by min-hashing into a particular feature bin, and across successive feature bins, the expanded set of weighted features get min-hashed and circularly smeared into the predetermined number of feature bins. Completed pre-calculated sets of feature bin values for each scaling of the object, together with the scaling factor, are stored for use in comparing sampled features of the object with sampled features of other objects by calculating an estimated Jaccard similarity index.
机译:所公开的系统和方法包括针对每个对象的对象特征仓值的预先计算,用于识别诸如文本文档之类的对象之间的紧密匹配,该对象具有大量加权特征,例如特定长度的单词序列。使用两个或多个选定的相邻缩放因子缩放预定的特征权重,并随机取整。对象的扩展加权特征集被最小化为预定数量的特征仓。对于合格的通过最小散列插入到特定特征箱中并跨接连续的特征箱插入的每个特征,扩展的加权特征集将被最小散列并循环涂抹到预定数量的特征箱中。存储用于对象的每个缩放的完整的预先计算的特征仓值集以及缩放因子,以用于通过计算估计的Jaccard相似性指数将对象的采样特征与其他对象的采样特征进行比较。

著录项

  • 公开/公告号US2020019814A1

    专利类型

  • 公开/公告日2020-01-16

    原文格式PDF

  • 申请/专利权人 SALESFORCE.COMINC.;

    申请/专利号US201916579706

  • 发明设计人 MARK MANASSE;

    申请日2019-09-23

  • 分类号G06K9/62;

  • 国家 US

  • 入库时间 2022-08-21 11:23:19

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号