首页> 外文会议>Conference on empirical methods in natural language processing >Boosting Text Classification Performance on Sexist Tweets by Text Augmentation and Text Generation Using a Combination of Knowledge Graphs
【24h】

Boosting Text Classification Performance on Sexist Tweets by Text Augmentation and Text Generation Using a Combination of Knowledge Graphs

机译:通过使用知识图形的组合,通过文本增强和文本生成提升文本分类性能。

获取原文

摘要

Text classification models have been heavily utilized for a slew of interesting natural language processing problems. Like any other machine learning model, these classifiers are very dependent on the size and quality of the training dataset. Insufficient and unbalanced datasets will lead to poor performance. An interesting solution to poor datasets is to take advantage of the world knowledge in the form of knowledge graphs to improve our training data. In this paper, we use ConceptNet and Wikidata to improve sexist tweet classification by two methods (1) text augmentation and (2) text generation. In our text generation approach, we generate new tweets by replacing words using data acquired from ConceptNet relations in order to increase the size of our training set, this method is very helpful with frustratingly small datasets, preserves the label and increases diversity. In our text augmentation approach, the number of tweets remains the same but their words are augmented (concatenation) with words extracted from their ConceptNet relations and their description extracted from Wikidata. In our text augmentation approach, the number of tweets in each class remains the same but the range of each tweet increases. Our experiments show that our approach improves sexist tweet classification significantly in our entire machine learning models. Our approach can be readily applied to any other small dataset size like hate speech or abusive language and text classificatbn problem using any machine learning model.
机译:文本分类模型已经大量利用了有趣的自然语言处理问题的扭转。与任何其他机器学习模型一样,这些分类器非常依赖于训练数据集的大小和质量。不足和不平衡的数据集会导致性能不佳。对恶劣数据集的一个有趣的解决方案是以知识图表的形式利用世界知识来改善我们的培训数据。在本文中,我们使用ConceptNet和Wikidata通过两种方法(1)文本增强和(2)文本生成来改善性别宣布的Tweet分类。在我们的文本生成方法中,我们通过使用从ConceptNet关系中获取的数据替换单词来生成新的推文,以便增加我们的培训集的大小,这种方法非常有助于令人沮丧的小型数据集,保留标签并增加多样性。在我们的文本增强方法中,推文的数量保持不变,但它们的单词是从其概念关系中提取的单词增强(连接),并从Wikidata中提取的描述。在我们的文本增强方法中,每个类中的推文的数量保持不变,但每个推文的范围都会增加。我们的实验表明,我们的方法在我们的整个机器学习模型中显着提高了性别歧视曲线分类。我们的方法可以很容易地应用于任何其他小型数据集大小,如仇恨语音或滥用语言,并使用任何机器学习模型进行文本分类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号