首页> 外文会议>International Conference on Language Resources and Evaluation >LibriVoxDeEn: A Corpus for German-to-English Speech Translation and German Speech Recognition
【24h】

LibriVoxDeEn: A Corpus for German-to-English Speech Translation and German Speech Recognition

机译:Librivoxdeen:德语语言翻译和德语演讲识别的语料库

获取原文

摘要

We present a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The speech translation data consist of 110 hours of audio material aligned to over 50k parallel sentences. An even larger dataset comprising 547 hours of German speech aligned to German text is available for speech recognition. The audio data is read speech and thus low in disfluencies. The quality of audio and sentence alignments has been checked by a manual evaluation, showing that speech alignment quality is in general very high. The sentence alignment quality is comparable to well-used parallel translation data and can be adjusted by cutoffs on the automatic alignment score. To our knowledge, this corpus is to date the largest resource for German speech recognition and for end-to-end German-to-English speech translation.
机译:我们基于德国音频书籍展示了德国音频,德语文本和英语翻译的句子对齐三元组的语料库。语音翻译数据由110小时的音频材料组成,与超过50k平行句子组成。一个甚至更大的数据集,包括547小时的德语语音与德语文本可用于语音识别。音频数据是读取语音,从而在不流失中低。手动评估检查了音频和句子对齐的质量,显示语音对齐质量通常非常高。句子对齐质量与使用良好的平行转换数据相当,并且可以通过自动对准分数的截止来调整。为了我们的知识,这个语料库是向德国语音识别的最大资源和最终到最后的德语语言翻译。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号