...
首页> 外文期刊>International Journal of Applied Engineering Research >An Efficient Informal Data Processing Method by Removing Duplicated Data
【24h】

An Efficient Informal Data Processing Method by Removing Duplicated Data

机译:删除重复数据的一种有效的非正式数据处理方法

获取原文
获取原文并翻译 | 示例
           

摘要

Useful information is extracted from many social networking services (SNS). The data uploaded by many users include the preferences or comments for the specific topics, which can be used for sentiment analysis. This sentiment information might be various social information, personal services and so on. However, SNS data are included a lot of duplicated data and spam data, which make slow down the processing time and the accuracy of the sentiment analysis. This paper presents an effective informal big data processing method by filtering out duplicated data or spam data. Hadoop Distributed File System (HDFS) and MapReduce method are used to extract sentiment information through machine learning. Experiment results increase not only the processing performance but also but also the accuracy of sentiment analysis. When duplication and spam are filtered out, the upload time is reduced about 4 seconds for 133,500 data. The analysis time after duplication and spam data are removed, is reduced by 36 percent and 41.26 percent respectively for 133,500 data. The incorrect spam detection is 20.53 percent for 35,320 data.
机译:有用的信息是从许多社交网络服务(SNS)中提取的。许多用户上传的数据包括特定主题的首选项或注释,可用于情绪分析。这种情绪信息可能是各种社交信息,个人服务等。但是,SNS数据包括大量重复的数据和垃圾邮件数据,这使得减慢了处理时间和情感分析的准确性。本文通过过滤掉重复的数据或垃圾邮件数据,提出了一种有效的非正式大数据处理方法。 Hadoop分布式文件系统(HDFS)和MapReduce方法用于通过机器学习提取情绪信息。实验结果不仅增加了加工性能,还增加了情绪分析的准确性。滤除重复和垃圾邮件时,133,500个数据将上传时间减少约4秒。除去复制和垃圾邮件数据后的分析时间分别减少了36%和41.26%,可分别为133,500个数据。 35,320个数据的错误垃圾邮件检测为20.53%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号