Finding the best way to utilize external/domain knowledge to enhance traditional text mining has been a challenging task. The difficulty centers on the lack of means in representing a document with external/domain knowledge integrated. Graphs are powerful and versatile tools, useful in various subfields of science and engineering for their simple illustration of complicated problems. However, the graph-based approach on knowledge representation and discovery remains relatively unexplored. In this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis. I present a Markov Random Field (MRF) with a Relaxation Labeling (RL) algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories as external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, and biomedical literature prove the effectiveness of the proposed methods.;In the experiment of document clustering, the proposed term-level graph-based method not only outperforms the baseline k-means algorithm in all configurations but also is superior in terms of efficiency. The MRF-based algorithm significantly improves spherical k-means and model-based k-means clustering on the datasets containing explicit or implicit linkage; the Wikipedia knowledge-based clustering also improves the document-content-only-based clustering. On the task of topic analysis, the proposed graph presentation, sub graph detection, and graph ranking algorithm can effectively identify corpus-level topic terms and cluster-level topic terms.
展开▼