The Winograd Schema Challenge (WSC) is a pronoun resolution task for which deep semantic knowledge is required to achieve high performance. Until now it has been assumed that human performance on the WSC is nearly at ceiling, but evidence for this has been mainly anecdotal. Here we present the results of a large online experiment that both establishes a baseline for human performance on the WSC and demonstrates the importance of human testing, not only as a means of validating a particular corpus, but more fundamentally as a guide in defining desirable characteristics for Winograd Schemas (WS).
展开▼