首页> 外国专利> Language modeling based on spoken and unspeakable corpuses

Language modeling based on spoken and unspeakable corpuses

机译:基于口语和无言语料库的语言建模

摘要

A computer system for language modeling may collect training data from one or more information sources, generate a spoken corpus containing text of transcribed speech, and generate a typed corpus containing typed text. The computer system may derive feature vectors from the spoken corpus, analyze the typed corpus to determine feature vectors representing items of typed text, and generate an unspeakable corpus by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a similarity threshold of a feature vector derived from the spoken corpus. The computer system may derive feature vectors from the unspeakable corpus and train a classifier to perform discriminative data selection for language modeling based on the feature vectors derived from the spoken corpus and the feature vectors derived from the unspeakable corpus.
机译:用于语言建模的计算机系统可以从一个或多个信息源收集训练数据,生成包含转录语音文本的口语语料库,并生成包含键入文本的语料库。该计算机系统可以从口语语料库中提取特征向量,分析类型化语料库以确定表示键入文本项的特征向量,并通过过滤类型化语料库以去除由特征向量表示的每个类型键入文本项来生成不可言语语料库。在从口语语料库得出的特征向量的相似性阈值之内。该计算机系统可以从无法说出的语料库导出特征向量,并且训练分类器,以基于从口头语料库得到的特征向量和从无法说出的语料库得到的特征向量来执行用于语言建模的判别数据选择。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号