Increasing influence of internet has led to huge amount of born-digital news articles being published on the internet. It is also becoming increasingly difficult to forage through the vast warehouse of these documents for preventing duplicity. Keywords are the most salient words in any textual document. We have introduced a graph-based approach for keyword extraction, using term co-occurrence in the textual news articles and integrating weighted closeness centrality (CC) with weighted clustering coefficient (WC). We have also proposed a metric namely co-occurrence index (CI) based on the extracted keywords for finding the amount of similarity between any two textual news articles. Our proposed method is independent of the ‘bag-of-word model’ and has shown significant performance improvement over the other existing methods.
展开▼