In the field of computational linguistics or natural language processing, progress in discourse analysis has been relatively slow, as compared with syntactic parsing or semantic analysis (e.g., word sense disambiguation, semantic role labeling). In this age when statistical, data-driven approaches dominate the field, having a common linguistic resource that is widely accepted by the community is key to advancing the state of the art in this area. To create consistently annotated data for discourse analysis is particularly challenging because one has to deal with larger linguistic structures and there are few linguistic rules to follow. The key to successful discourse annotation is to identify a well-grounded linguistic theory that can be easily operationalized. In the Perm Discourse Treebank (Prasad et al. 2008, Webber and Joshi 1998) the field may have found such a theory. In the PDTB conception, discourse relations revolve around discourse connectives, where each discourse connective is a predicate that takes two arguments. In this way, discourse annotations are anchored by discourse connectives and are thus lexicalized. In our view, lexicalization has been crucial to the success of the PDTB as an annotation project, a large-scale effort characterized by high inter-annotator agreement, a standard metric for annotation consistency. Lexicalization makes highly abstract discourse relations grounded to a specific lexical item. In doing so, it localizes the ambiguity in discourse relations to discourse connectives, where a lexical item can have either a discourse connective use or a non-discourse connective use (e.g., "when"), and one discourse connective can be ambiguous between different discourse relations (e.g., "since"). As a result, it reduces the cognitive load of the annotation task because each annotator can focus on only one discourse connective at a time instead of scores of discourse relations. This in turn enlarges the annotator pool and more annotators will be able to perform the task without having to have extensive training. The long list of annotators who worked on the PDTB annotation attests to this observation. A larger annotator pool and a shorter learning curve translates to the scalability of such an approach. If lexicalization is so important to discourse annotation, what about discourse relations that are not anchored by an explicit discourse connective? The PDTB addresses this by assuming there is an implicit discourse connective that connects its two arguments, which are typically (parts of) adjacent sentences. This is operationalized by identifying punctuation marks (e.g., periods) that serve as boundaries of two adjacent sentences as anchors of implicit discourse relations. The specific discourse relation is determined by testing which discourse connective can be plausibly inserted between these two adjacent sentences. In doing so, the PDTB assumes that (1) the range of possible discourse relations anchored by implicit discourse connectives are basically the same as those anchored by explicit discourse relations, and (2) discourse relations anchored by implicit discourse connectives are mostly local. The first assumption is largely born out in the PDTB. Either a discourse connective can be inserted between two adjacent sentences, or they are related by the fact that they talk about the same entities, or there is no relation between them. The last possibility has a direct bearing on the second question: if there is no relation between two adjacent sentences, does that mean that these sentences have no discourse relations at all with the rest of the text, or that they are related to other discourse segments that are non-local? It is reasonable to assume that all discourse segments are related in a coherent piece of text, and large number of such "no-relations" would call for a significant expansion to the PDTB approach. While it might not be too much to expect that the same high-level discourse relations hold across languages, it is almost certainly too much to expect that discourse relations are lexicalized in the same way across languages. The question is whether a lexicalized approach to discourse analysis can still be maintained in languages where discourse relations are lexicalized in ways that are significantly different from English . Our experience in a pilot PDTB-style Chinese discourse annotation project shows that the lexicalized approach can be effectively adopted, although significant adaptations have to be made. Chinese has the same types of discourse connectives (subordinate conjunctions, coordinate conjunctions, and discourse adverbials) as English, but they occur much less frequently because they can often be dropped. The ratio of implicit and explicit connectives is about 80/20 (Zhou and Xue, 2012) rather than the roughly 50/50 split reported for PDTB (Prasad et al 2008). However, by identifying punctuation marks as boundaries of discourse segments and test whether lexicalized discourse relations hold between adjacent comma-separated discourse segments, we are able to show that Chinese discourse annotation can be performed with very good consistency. More evidence has to be gathered from the experience of other languages to test the feasibility of lexicalized approaches to discourse annotation in a multi-lingual setting, and such evidence will come soon now that such an approach has been adopted in a number of discourse annotation projects for a variety of different languages.
展开▼