首页> 外国专利> Method for automatically identifying sentence boundaries in noisy conversational data

Method for automatically identifying sentence boundaries in noisy conversational data

机译:自动识别嘈杂会话数据中句子边界的方法

摘要

Sentence boundaries in noisy conversational transcription data are automatically identified. Noise and transcription symbols are removed, and a training set is formed with sentence boundaries marked based on long silences or on manual markings in the transcribed data. Frequencies of head and tail n-grams that occur at the beginning and ending of sentences are determined from the training set. N-grams that occur a significant number of times in the middle of sentences in relation to their occurrences at the beginning or ending of sentences are filtered out. A boundary is marked before every head n-gram and after every tail n-gram occurring in the conversational data and remaining after filtering. Turns are identified. A boundary is marked after each turn, unless the turn ends with an impermissible tail word or is an incomplete turn. The marked boundaries in the conversational data identify sentence boundaries.
机译:自动识别嘈杂的会话转录数据中的句子边界。去除噪音和转录符号,并形成训练集,并基于长时间的沉默或转录数据中的手动标记来标记句子边界。从训练集中确定句子开头和结尾处出现的头和尾n-gram的频率。相对于在句子开头或结尾出现的次数,在句子中间出现大量N-gram的情况被过滤掉。在对话数据中出现的每个头部n-gram之前和之后的每个尾部n-gram后面都标记一个边界,并在过滤后保留。确定转弯。每次转弯后都会标出边界,除非转弯以不允许的结尾词结尾或不完整的转弯。会话数据中标记的边界标识句子边界。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号