Domain corpus independent vocabulary generation for embedded continuous speech recognition

Minkyu Lim; Kwang-Ho Kim; Ji-Hwan Kim

首页> 外文期刊>Consumer Electronics, IEEE Transactions on >Domain corpus independent vocabulary generation for embedded continuous speech recognition

【24h】

Domain corpus independent vocabulary generation for embedded continuous speech recognition

机译：领域语料库独立词汇生成，用于嵌入式连续语音识别

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper proposes a domain corpus independent vocabulary generation algorithm in order to improve the coverage of vocabulary for embedded Continuous Speech Recognition (CSR). A vocabulary in CSR is normally derived from a word frequency list. Therefore, the vocabulary coverage is dependent on a domain corpus. We present an improved way of vocabulary generation using Part-Of-Speech (POS) tagged corpus and knowledge base. We investigate 152 POS tags defined in a POS tagged corpus and word-POS tag pairs. We analyze all words paired with 101 among 152 POS tags and decide on a set of words which have to be included in vocabularies of any size. The other 51 POS tags are mainly categorized with noun-related, Named Entity (NE)-related and verb-related POSs. We introduce a domain corpus independent word inclusion method for noun-, verb-, and NE-related POS tags using knowledge base. For noun-related POS tags, we generate synonym groups and analyze their relative importance using Google search. Then, we categorize verbs by lemma and analyze relative importance of each lemma from a pre-analyzed statistic for verbs. We determine the inclusion order of NEs through Google search. The proposed method shows at least 28.6% relative improvement of coverage for a SMS text corpus when the sizes of vocabulary are 5K, 10K, 15K and 20K. In particular, the coverage of 15K size vocabulary generated by the proposed method reaches up to 97.8% with the relative improvement of 44.2%.

机译：为了提高嵌入式连续语音识别（CSR）的词汇覆盖率，本文提出了一种独立于领域语料库的词汇生成算法。 CSR中的词汇通常是从单词频率列表中得出的。因此，词汇覆盖率取决于领域语料库。我们提出了一种使用词性（POS）标记的语料库和知识库的改进的词汇生成方式。我们调查了152个POS标签，这些标签在POS标签语料库和word-POS标签对中定义。我们分析了152个POS标签中与101个单词配对的所有单词，并决定了必须包含在任何大小的词汇表中的一组单词。其他51个POS标签主要分为与名词相关，与命名实体（NE）相关和与动词相关的POS。我们介绍了一种基于知识库的名词，动词和与NE相关的POS标签的领域语料库独立词包含方法。对于与名词相关的POS标签，我们会生成同义词组并使用Google搜索分析其相对重要性。然后，我们按引理对动词进行分类，并根据预先分析的动词统计数据分析每个引理的相对重要性。我们通过Google搜索确定NE的包含顺序。当词汇量为5K，10K，15K和20K时，所提出的方法显示SMS文本语料库的覆盖率至少提高了28.6％。特别是，该方法产生的15K大小的词汇覆盖率达到97.8％，相对提高了44.2％。

著录项

来源
《Consumer Electronics, IEEE Transactions on》 |2009年第3期|p.1631-1636|共6页
作者
Minkyu Lim; Kwang-Ho Kim; Ji-Hwan Kim;
展开▼
作者单位

Dept. of Computer Science and Engineering, Sogang University, Seoul, Korea (e-mail: {lmkhi, kimkwangho, kimjihwan}@sogang.ac.kr);

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Coverage; Domain corpus independent; Embedded speech recognition; Vocabulary;

机译：覆盖;域语料库独立;嵌入式语音识别;词汇量;

相似文献

外文文献
中文文献
专利

1. JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research [J] . Katunobu Itou, Mikio Yamamoto, Kazuya Takeda, The Journal of the Acoustical Society of Japan . 1999,第3期

机译：JNAS：日语语音语料库，用于大词汇量连续语音识别研究
2. Arabic Speaker-Independent Continuous Automatic Speech Recognition Based on a Phonetically Rich and Balanced Speech Corpus [J] . Mohammad Abushariah, Raja Ainon, Roziati Zainuddin, The international arab journal of information technology . 2012,第1期

机译：基于语音丰富均衡的语料库的阿拉伯语独立于说话人的连续自动语音识别
3. Generalized mel frequency cepstral coefficients forlarge-vocabulary speaker-independent continuous-speech recognition [J] . Vergin R., OShaughnessy D., Farhat A. IEEE Transactions on Speech and Audio Proceessing . 1999,第5期

机译：广义梅尔频率倒谱系数用于大词汇量独立于说话人的连续语音识别
4. WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition [C] . Robinson, T., Fransen, . 1995

机译：WSJCAMO：用于大词汇量连续语音识别的英式英语语音语料库
5. Real-time speaker -independent large vocabulary continuous speech recognition. [D] . Li, Xiaolong. 2005

机译：实时独立于说话者的大词汇量连续语音识别。
6. Comparative Evaluation of Three Continuous Speech Recognition Software Packages in the Generation of Medical Reports [O] . Eric G. Devine, Stephan A. Gaehde, Arthur C. Curtis 2000

机译：三种连续语音识别软件的比较评估医疗报告生成中的软件包
7. JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research. [O] . Katunobu Itou, Mikio Yamamoto, Kazuya Takeda, 1999

机译：JNAS：日语语音语料库，用于大词汇连续语音识别研究。
8. Vocabulary and Environment Adaptation in Vocabulary-Independent Speech Recognition. [R] . Hon, H., Lee, K. 1992

机译：词汇独立语音识别中的词汇与环境适应。

Domain corpus independent vocabulary generation for embedded continuous speech recognition

摘要

著录项

相似文献

相关主题

期刊订阅