The scarcity of large corpora in reading dis-ambiguated words is a major limitation in linguistic analysis and the initiation of a statistical approach to word reading disambiguation. As readings of words are usually not written in documents like meanings of words, therefore, human annotation is necessary but expensive. In this study, a method is proposed to construct a reading disambiguated dataset for word reading disambiguation. The method constructs a dataset of sentences wherein words with ambiguity in reading (pronunciation), called hcteronyms, are tagged for correct reading. In this method, a word with unique reading is labeled to a heteronym, and this unique word is used as a query word to collect sentences that include the word. The word in the collected sentences is replaced by the original ambiguous word and the reading corresponding to that of the query word is tagged as the pronunciation of the heteronym. It was confirmed through experiments that the method was able to collect data effectively, and the collected data was numerically balanced among all the readings of the heteronym.
展开▼