Training and Prediction Data Discrepancies: Challenges of Text Classification with Noisy, Historical Data

机译：培训和预测数据差异：文本分类与嘈杂，历史数据的挑战

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Industry datasets used for text classification are rarely created for that purpose. In most cases, the data and target predictions are a byproduct of accumulated historical data, typically fraught with noise, present in both the text-based document, as well as in the targeted labels. In this work, we address the question of how well performance metrics computed on noisy, historical data reflect the performance on the intended future machine learning model input. The results demonstrate the utility of dirty training datasets used to build prediction models for cleaner (and different) prediction inputs.

机译：为此目的很少创建用于文本分类的行业数据集。在大多数情况下，数据和目标预测是累积历史数据的副产品，通常用基于文本的文档以及目标标签中存在的噪声。在这项工作中，我们解决了在嘈杂计算的性能指标的问题问题的问题，历史数据反映了预期未来机器学习模型输入的性能。结果展示了脏训练数据集的效用，用于构建清洁（和不同）预测输入的预测模型。

著录项

来源
《Conference on empirical methods in natural language processing》|2018年|xv 216 p.|共6页
会议地点
作者
Emilia Apostolova; R. Andrew Kreek;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序设计、软件工程;
关键词

相似文献

外文文献
中文文献
专利

1. Using the revised EM algorithm to remove noisy data for improving the one-against-the-rest method in binary text classification [J] . Hyoungdong Han, Youngjoong Ko, Jungyun Seo Information Processing & Management . 2007,第5期

机译：使用改进的EM算法去除噪声数据，以改进二进制文本分类中的“一对一休息”方法
2. Authorship Attribution of Short Historical Arabic Texts using Stylometric Features and a KNN Classifier with Limited Training Data [J] . Fatma Howedi, Masnizah Mohd, Zahra Aborawi Aborawi, Journal of computer sciences . 2020,第10期

机译：短期阿拉伯语文本的作者归属使用仪表特征和具有有限培训数据的KNN分类器
3. Authorship Attribution of Short Historical Arabic Texts using Stylometric Features and a KNN Classifier with Limited Training Data [J] . Fatma Howedi, Masnizah Mohd, Zahra Aborawi Aborawi, Journal of computer sciences . 2020,第10期

机译：短期阿拉伯语文本的作者归属使用仪表特征和KNN分类器，具有有限的培训数据
4. Training and Prediction Data Discrepancies: Challenges of Text Classification with Noisy, Historical Data [C] . Emilia Apostolova, R. Andrew Kreek Fourth workshop on noisy user-generated text . 2018

机译：训练和预测数据的差异：带有噪声，历史数据的文本分类的挑战
5. Synthesizing additional training data to increase the classification accuracy of visual data using feed-forward neural networks on small datasets. [D] . Qumsieh, Rafi. 2017

机译：在小型数据集上使用前馈神经网络合成其他训练数据，以提高视觉数据的分类准确性。
6. Event-Dataset: Temporal information retrieval and text classification dataset [O] . Shafiq Ur Rehman Khan, Muhammad Arshad Islam 2019

机译：事件数据集：时间信息检索和文本分类数据集
7. Training and Prediction Data Discrepancies: Challenges of Text Classification with Noisy, Historical Data [O] . R. Andrew Kreek, Emilia Apostolova 2018

机译：培训和预测数据差异：文本分类与嘈杂，历史数据的挑战

Training and Prediction Data Discrepancies: Challenges of Text Classification with Noisy, Historical Data

摘要

著录项

相似文献

相关主题

期刊订阅