首页> 外文会议>Workshop on Advances in Discourse Analysis and its Computational Aspects >Explicit and implicit discourse relations from a cross-lingual perspective - from experience in working on Chinese discourse annotation
【24h】

Explicit and implicit discourse relations from a cross-lingual perspective - from experience in working on Chinese discourse annotation

机译:来自奇异视角的明确和隐含的话语关系 - 从中​​国语篇注释工作的经验

获取原文

摘要

In the field of computational linguistics or natural language processing, progress in discourse analysis has been relatively slow, as compared with syntactic parsing or semantic analysis (e.g., word sense disambiguation, semantic role labeling). In this age when statistical, data-driven approaches dominate the field, having a common linguistic resource that is widely accepted by the community is key to advancing the state of the art in this area. To create consistently annotated data for discourse analysis is particularly challenging because one has to deal with larger linguistic structures and there are few linguistic rules to follow. The key to successful discourse annotation is to identify a well-grounded linguistic theory that can be easily operationalized. In the Perm Discourse Treebank (Prasad et al. 2008, Webber and Joshi 1998) the field may have found such a theory. In the PDTB conception, discourse relations revolve around discourse connectives, where each discourse connective is a predicate that takes two arguments. In this way, discourse annotations are anchored by discourse connectives and are thus lexicalized. In our view, lexicalization has been crucial to the success of the PDTB as an annotation project, a large-scale effort characterized by high inter-annotator agreement, a standard metric for annotation consistency. Lexicalization makes highly abstract discourse relations grounded to a specific lexical item. In doing so, it localizes the ambiguity in discourse relations to discourse connectives, where a lexical item can have either a discourse connective use or a non-discourse connective use (e.g., "when"), and one discourse connective can be ambiguous between different discourse relations (e.g., "since"). As a result, it reduces the cognitive load of the annotation task because each annotator can focus on only one discourse connective at a time instead of scores of discourse relations. This in turn enlarges the annotator pool and more annotators will be able to perform the task without having to have extensive training. The long list of annotators who worked on the PDTB annotation attests to this observation. A larger annotator pool and a shorter learning curve translates to the scalability of such an approach. If lexicalization is so important to discourse annotation, what about discourse relations that are not anchored by an explicit discourse connective? The PDTB addresses this by assuming there is an implicit discourse connective that connects its two arguments, which are typically (parts of) adjacent sentences. This is operationalized by identifying punctuation marks (e.g., periods) that serve as boundaries of two adjacent sentences as anchors of implicit discourse relations. The specific discourse relation is determined by testing which discourse connective can be plausibly inserted between these two adjacent sentences. In doing so, the PDTB assumes that (1) the range of possible discourse relations anchored by implicit discourse connectives are basically the same as those anchored by explicit discourse relations, and (2) discourse relations anchored by implicit discourse connectives are mostly local. The first assumption is largely born out in the PDTB. Either a discourse connective can be inserted between two adjacent sentences, or they are related by the fact that they talk about the same entities, or there is no relation between them. The last possibility has a direct bearing on the second question: if there is no relation between two adjacent sentences, does that mean that these sentences have no discourse relations at all with the rest of the text, or that they are related to other discourse segments that are non-local? It is reasonable to assume that all discourse segments are related in a coherent piece of text, and large number of such "no-relations" would call for a significant expansion to the PDTB approach. While it might not be too much to expect that the same high-level discourse relations
机译:在计算语言学或自然语言处理领域,与句法解析或语义分析相比,话语分析的进展相比相对较慢(例如,词感歧义,语义角色标记)。在这个时代,当统计数据驱动的方法占据了该领域的统治性地位,具有广泛接受的常见语言资源,这些资源被社区广泛接受,是推进该领域的艺术状态的关键。为了创建一致的话语分析数据,尤其具有挑战性,因为一个人必须处理更大的语言结构,并且有很少的语言规则。成功的话语注释的关键是识别可以很容易运作的基础语言理论。在烫发话语TreeBank(Prasad等,2008,Webber和Joshi 1998)该领域可能已经找到了这样的理论。在PDTB的概念中,话语关系围绕话语连接旋转,每个话语连接都是谓词,它需要两个参数。通过这种方式,话语注释由话语连接锚定,因此是lexicalized。在我们看来,Lexicalization对PDTB作为注释项目的成功至关重要,这是一种大规模的努力,其具有高的注释协议,标准度量是注释一致性的标准度量。词汇化使得高度抽象的话语关系接地为特定的词汇项目。在这样做时,它定位了话语关系中的歧义与话语连接,其中词汇项目可以具有话语结缔组织或非话语结缔组织(例如,“何时”),并且一个话语连接可以在不同之间模糊话语关系(例如“以来”)。结果,它减少了注释任务的认知负载,因为每个注释器一次只能专注于一次一个话语连接而不是话语关系的分数。这反过来扩大了注释池,更多的注释器将能够执行任务,而无需具有广泛的培训。在PDTB注释上工作的漫长的注释者列表证明了这种观察。更大的注释池和更短的学习曲线转化为这种方法的可扩展性。如果词汇化对话语注释非常重要,那么话语关系呢是未经明确的话语结缔组织的话语? PDTB通过假设存在一个隐式的话语结缔组来解决它的两个参数,通常是(通常是相邻句子)。这是通过识别用作两个相邻句子的边界的标点符号(例如,期间)来运行,作为隐式话语关系的锚点。具体的话语关系是通过测试在这两个相邻句子之间的话语连接的测试中的测试。在这样做时,PDTB假设(1)由隐式话语联系锚定的可能话语关系范围基本上与通过明确话语关系锚定的人,并且(2)通过隐式话语联系锚定的话语关系主要是本地。第一个假设在PDTB中很大程度上出生。话语连接可以插入两个相邻的句子之间,或者它们与他们谈论同一实体的事实相关,或者它们之间没有关系。最后一个可能性在第二个问题上直接承担:如果两个相邻句子之间没有关系,这是否意味着这些句子完全没有涉及其余文本的话语关系,或者它们与其他话语段相关这是非本地的?假设所有话语段都是合理的,所有话语段都是在连贯的文本中相关的,并且大量这样的“无关系”将要求对PDTB方法进行重大扩张。虽然它可能不会过分预期,但期望相同的高级话语关系

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号