首页> 外文会议>Proceedings of 2012 IEEE 3rd international conference on emergency management and management sciences >The Construction of English-Chinese Parallel Corpus of Medical Works Based on Self-Coded Python Programs
【24h】

The Construction of English-Chinese Parallel Corpus of Medical Works Based on Self-Coded Python Programs

机译:基于自编码Python程序的英汉医学并行医学语料库的构建

获取原文
获取原文并翻译 | 示例

摘要

In order to provide sufficient training data for statistical machine(-aided) translation in medical field, a large scale English-Chinese parallel corpus of medical works is constructed. Eighteen English medical printed books with Chinese translation are selected as raw materials.With the help of an OCR scanner, all texts are recognized, manually proofread and stored in electrical form.Within a rigid scheme of corpus construction and with the help of a self-coded Python program, English and Chinese texts are separated, sentence aligned and XML marked.After careful manual proofreading, an Internet-based corpus retrieval platform is constructed. The present parallel corpus contains 54,522 sentence pairs and more than 2,500,000 English words / Chinese characters, which can be preliminarily applied in the training and testing of statistical machine(-aided) translation researches in medical field.
机译:为了给医学领域的统计机器翻译提供足够的训练数据,构建了大规模的英汉医学平行语料库。选取18篇带有中文翻译的英文医学印刷书籍作为原材料。借助OCR扫描仪,所有文本都可以被识别,手动校对并以电子形式存储。在严格的语料库构建方案和自我识别的帮助下,编码的Python程序,中英文文本分开,句子对齐并标有XML。经过仔细的手动校对,构建了一个基于Internet的语料库检索平台。目前的并行语料库包含54,522个句子对和超过2,500,000个英语单词/汉字,可初步应用于医学领域统计机器(辅助)翻译研究的训练和测试。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号