...
首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >Maximal Figure-of-Merit Framework to Detect Multi-Label Phonetic Features for Spoken Language Recognition
【24h】

Maximal Figure-of-Merit Framework to Detect Multi-Label Phonetic Features for Spoken Language Recognition

机译:最大值框架用于检测用于口语语言识别的多标签语音特征

获取原文
获取原文并翻译 | 示例
           

摘要

Bottleneck features (BNFs) generated with a deep neural network (DNN) have proven to boost spoken language recognition accuracy over basic spectral features significantly. However, BNFs are commonly extracted using language-dependent tied-context phone states as learning targets. Moreover, BNFs are less phonetically expressive than the output layer in a DNN, which is usually not used as a speech feature because of its very high dimensionality hindering further post-processing. In this article, we put forth a novel deep learning framework to overcome all of the above issues and evaluate it on the 2017 NIST Language Recognition Evaluation (LRE) challenge. We use manner and place of articulation as speech attributes, which lead to low-dimensional “universal” phonetic features that can be defined across all spoken languages. To model the asynchronous nature of the speech attributes while capturing their intrinsic relationships in a given speech segment, we introduce a new training scheme for deep architectures based on a Maximal Figure of Merit (MFoM) objective. MFoM introduces non-differentiable metrics into the backpropagation-based approach, which is elegantly solved in the proposed framework. The experimental evidence collected on the recent NIST LRE 2017 challenge demonstrates the effectiveness of our solution. In fact, the performance of speech language recognition (SLR) systems based on spectral features is improved for more than 5% absolute Cavg. Finally, the F1 metric can be brought from 77.6% up to 78.1% by combining the conventional baseline phonetic BNFs with the proposed articulatory attribute features.
机译:用深神经网络(DNN)产生的瓶颈特征(BNF)已被证明可以显着通过基本光谱特征提高语言识别精度。但是,BNF通常使用语言依赖的TIED-Context Pharm状态作为学习目标来提取。此外,BNFS比DNN中的输出层较少,该输出层通常不用作语音特征,因为其非常高的维度妨碍了进一步的后处理。在本文中,我们提出了一种新颖的深度学习框架,以克服所有上述问题,并在2017年NIST语言识别评估(LRE)挑战上评估它。我们使用铰接的方式和地点作为语音属性,这导致了可以在所有口头语言中定义的低维“通用”语音特征。为了在给定语音段中捕获其内部关系的同时模拟语音属性的异步性质,我们基于最大优点(MFOM)目标的深度架构引入了新的培训方案。 MFOM将非可分辨率的指标引入基于BackProjagation的方法,在所提出的框架中典雅地解决了。最近NIST LRE 2017挑战收集的实验证据表明了我们解决方案的有效性。实际上,基于频谱特征的语音语言识别(SLR)系统的性能得到了5%以上的绝对ACKG。最后,通过将传统的基线拼音BNF与所提出的铰接属性特征组合,F1度量可以从77.6%增加到77.6%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号