首页> 外文会议>Proceedings 2015 Resilience Week >Optimal stop word selection for text mining in critical infrastructure domain
【24h】

Optimal stop word selection for text mining in critical infrastructure domain

机译:关键基础结构领域中文本挖掘的最佳停用词选择

获取原文
获取原文并翻译 | 示例

摘要

Eliminating all stop words from the feature space is a standard practice of preprocessing in text mining, regardless of the domain which it is applied to. However, this may result in loss of important information, which adversely affects the accuracy of the text mining algorithm. Therefore, this paper proposes a novel methodology for selecting the optimal set of domain specific stop words for improved text mining accuracy. First, the presented methodology retains all the stop words in the text preprocessing phase. Then, an evolutionary technique is used to extract the optimal set of stop words that result in the best classification accuracy. The presented methodology was implemented on a corpus of open source news articles related to critical infrastructure hazards. The first step of mining geo-dependencies among critical infrastructures from text is text classification. In order to achieve this, article content was classified into two classes: 1) text content with geo-location information, and 2) text content without geo-location information. Classification accuracy presented methodology was compared to accuracies of four other test cases. Experimental results with 10-fold cross validation showed that the presented method yielded an increase of 1.76% or higher in True Positive (TP) rate and a 2.27% or higher increase in the True Negative (TN) rate compared to the other techniques.
机译:从特征空间中消除所有停用词是文本挖掘中预处理的一种标准做法,而不管它应用于哪个领域。但是,这可能会导致重要信息的丢失,从而不利地影响文本挖掘算法的准确性。因此,本文提出了一种新的方法,用于选择领域特定停用词的最佳集合,以提高文本挖掘的准确性。首先,所提出的方法在文本预处理阶段保留所有停用词。然后,使用进化技术来提取最佳停用词集,从而获得最佳分类精度。所介绍的方法是在与关键基础设施危害相关的一系列开源新闻文章中实施的。从文本挖掘关键基础设施之间的地理依赖性的第一步是文本分类。为了实现此目的,文章内容被分为两类:1)具有地理位置信息的文本内容,和2)没有地理位置信息的文本内容。将分类准确性提出的方法与其他四个测试用例的准确性进行了比较。进行10倍交叉验证的实验结果表明,与其他技术相比,该方法的真阳性(TP)率提高了1.76%或更高,真阴性(TN)率提高了2.27%或更高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号