...
首页> 外文期刊>The Electronic Library >Semi-automatic extraction of multiword terms from domain-specific corpora
【24h】

Semi-automatic extraction of multiword terms from domain-specific corpora

机译:从特定领域语料库中半自动提取多词术语

获取原文
获取原文并翻译 | 示例
           

摘要

Purpose - A hybrid approach is presented, which combines linguistic and statistical information to semi-automatically extract multiword term candidates from texts. Design/methodology/approach - The method is designed to be domain and language independent, focusing on languages with rich morphology. Here, it is used for extracting multiword terms from texts in Serbian, belonging to the agricultural engineering domain, as a use case. Predefined syntactic structures were used for multiword terms. For each structure, a finite state transducer was developed, which recognizes text sequences having that structure and outputs the sequence in a normalized form, so that different inflectional forms of the same multiword term can be counted properly. Term candidates were further filtered by their frequencies and evaluated by two domain experts. Findings - By using language resources, such as electronic dictionaries and grammars, 928 multiword terms were extracted out of 1,523 multiword terms that were recognized as candidates from a corpus having 42,260 different simple word forms; 870 of these were new, not already contained in the existing electronic dictionary of compounds for Serbian, and they were used to enrich the dictionary. Originality/value - The paper presents methodology that can significantly contribute to the development of terminology lexicons in different areas. In this particular use case, some important agricultural engineering concepts were extracted from the text, but this approach could be used for other domains and languages as well.
机译:目的-提出了一种混合方法,该方法结合了语言和统计信息,以半自动地从文本中提取多词候选词。设计/方法/方法-该方法被设计为与领域和语言无关,专注于具有丰富形态的语言。在这里,它用作用例,它从属于农业工程领域的塞尔维亚语文本中提取多词术语。预定义的语法结构用于多词术语。对于每种结构,开发了一种有限状态换能器,该换能器识别具有该结构的文本序列,并以规范化形式输出该序列,以便可以正确计算同一多字词的不同变形形式。术语候选人通过其频率进一步过滤,并由两名领域专家进行评估。调查结果-通过使用电子词典和语法之类的语言资源,从被认为是具有42,260个不同简单单词形式的语料库的候选单词的1,523个多单词词中,提取了928个多单词词;其中870个是新的,尚未包含在现有的塞族化合物电子词典中,并且被用来丰富该词典。原创性/价值-本文提出的方法可以极大地促进不同领域术语词典的发展。在这个特定的用例中,从文本中提取了一些重要的农业工程概念,但是这种方法也可以用于其他领域和语言。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号