首页> 外文期刊>Consumer Electronics, IEEE Transactions on >Domain corpus independent vocabulary generation for embedded continuous speech recognition
【24h】

Domain corpus independent vocabulary generation for embedded continuous speech recognition

机译:领域语料库独立词汇生成,用于嵌入式连续语音识别

获取原文
获取原文并翻译 | 示例
           

摘要

This paper proposes a domain corpus independent vocabulary generation algorithm in order to improve the coverage of vocabulary for embedded Continuous Speech Recognition (CSR). A vocabulary in CSR is normally derived from a word frequency list. Therefore, the vocabulary coverage is dependent on a domain corpus. We present an improved way of vocabulary generation using Part-Of-Speech (POS) tagged corpus and knowledge base. We investigate 152 POS tags defined in a POS tagged corpus and word-POS tag pairs. We analyze all words paired with 101 among 152 POS tags and decide on a set of words which have to be included in vocabularies of any size. The other 51 POS tags are mainly categorized with noun-related, Named Entity (NE)-related and verb-related POSs. We introduce a domain corpus independent word inclusion method for noun-, verb-, and NE-related POS tags using knowledge base. For noun-related POS tags, we generate synonym groups and analyze their relative importance using Google search. Then, we categorize verbs by lemma and analyze relative importance of each lemma from a pre-analyzed statistic for verbs. We determine the inclusion order of NEs through Google search. The proposed method shows at least 28.6% relative improvement of coverage for a SMS text corpus when the sizes of vocabulary are 5K, 10K, 15K and 20K. In particular, the coverage of 15K size vocabulary generated by the proposed method reaches up to 97.8% with the relative improvement of 44.2%.
机译:为了提高嵌入式连续语音识别(CSR)的词汇覆盖率,本文提出了一种独立于领域语料库的词汇生成算法。 CSR中的词汇通常是从单词频率列表中得出的。因此,词汇覆盖率取决于领域语料库。我们提出了一种使用词性(POS)标记的语料库和知识库的改进的词汇生成方式。我们调查了152个POS标签,这些标签在POS标签语料库和word-POS标签对中定义。我们分析了152个POS标签中与101个单词配对的所有单词,并决定了必须包含在任何大小的词汇表中的一组单词。其他51个POS标签主要分为与名词相关,与命名实体(NE)相关和与动词相关的POS。我们介绍了一种基于知识库的名词,动词和与NE相关的POS标签的领域语料库独立词包含方法。对于与名词相关的POS标签,我们会生成同义词组并使用Google搜索分析其相对重要性。然后,我们按引理对动词进行分类,并根据预先分析的动词统计数据分析每个引理的相对重要性。我们通过Google搜索确定NE的包含顺序。当词汇量为5K,10K,15K和20K时,所提出的方法显示SMS文本语料库的覆盖率至少提高了28.6%。特别是,该方法产生的15K大小的词汇覆盖率达到97.8%,相对提高了44.2%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号