首页> 外国专利> MINING NEW WORDS FROM A QUERY LOG FOR INPUT METHOD EDITORS

MINING NEW WORDS FROM A QUERY LOG FOR INPUT METHOD EDITORS

机译:从查询日志中为输入法编辑器挖掘新单词

摘要

Described is a technology in which new words (including a phrase or set of Chinese characters) are mined from a query log. The new words may be added to (or otherwise supplement) an IME dictionary. A set of candidate queries may be selected from the log based upon market (e.g., the Chinese market) and/or by language. From this set, various filtering steps are performed to locate only new words that are frequently in used. For example, only frequent queries are kept for further processing, which may include filtering out queries based on length (e.g., less than two or greater than eight Chinese characters), and/or filtering out queries based on too many stop-words in the query. Processing may also include filtering out a query that is a substring of a larger query, or vice-versa. Also described is Pinyin-based clustering and filtering, and filtering out queries already handled in the dictionary.
机译:描述了一种从查询日志中提取新词(包括短语或一组汉字)的技术。新单词可以添加到(或以其他方式补充)IME词典。可以基于市场(例如,中国市场)和/或通过语言从日志中选择一组候选查询。从这个集合中,执行各种过滤步骤以仅定位经常使用的新单词。例如,仅保留频繁查询以进行进一步处理,这可以包括基于长度(例如,少于两个或大于八个汉字)过滤掉查询,和/或基于过滤条件中太多的停用词过滤掉查询。查询。处理还可以包括滤除作为较大查询的子串的查询,反之亦然。还介绍了基于拼音的聚类和筛选,以及筛选出字典中已处理的查询。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号