The Effect of Using Masked Language Models in Random Textual Data Augmentation

机译：在随机文本数据增强中使用屏蔽语言模型的效果

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Powerful yet simple augmentation techniques have significantly helped modern deep learning-based text classifiers to become more robust in recent years. Although these augmentation methods have proven to be effective, they often utilize random or non-contextualized operations to generate new data. In this work, we modify a specific augmentation method called Easy Data Augmentation or EDA with more sophisticated text editing operations powered by masked language models such as BERT and RoBERTa to analyze the benefits or setbacks of creating more linguistically meaningful and hopefully higher quality augmentations. Our analysis demonstrates that using a masked language model for word insertion almost always achieves better results than the initial method but it comes at a cost of more time and resources which can be comparatively remedied by deploying a lighter and smaller language model like DistilBERT.

机译：强大而简单的增强技术显着帮助现代基于深度学习的文本分类器近年来变得更加强劲。虽然这些增强方法已被证明是有效的，但它们通常利用随机或非内容化的操作来生成新数据。在这项工作中，我们修改了一个特定的增强方法，称为简单的数据增强或EDA，具有更复杂的文本编辑操作，由屏蔽语言模型（如BERT和Roberta），以分析创建更多语言有意义和希望更高质量的增强的好处或挫折。我们的分析表明，使用用于Word插入的屏蔽语言模型几乎总是始终实现比初始方法更好的结果，但它以更高的时间和资源来实现，这些时间和资源可以通过部署较浅和更小的语言模型来竞争地进行ZeriLerber。

著录项

来源
《International Computer Conference, Computer Society of Iran》|2021年|1-5|共5页
会议地点
作者
Mohammad Amin Rashid; Hossein Amirkhani;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Analytical models; Computational modeling; Text categorization; Bit error rate; Tools; Data models;

机译：分析模型;计算建模;文本分类;误码率;工具;数据模型;

相似文献

外文文献
中文文献
专利

1. Novel textual features for language modeling of intra-sentential code-switching data [J] . Sreeram Ganji, Kunal Dhawan, Rohit Sinha Computer speech and language . 2020,第Nova期

机译：语言建模的新型文本特征 - 句子码切换数据的语言建模
2. Textual Data Selection for Language Modelling in the Scope of Automatic Speech Recognition [J] . Freha Mezzoudj, David Langlois, Denis Jouvet, Procedia Computer Science . 2018,第1期

机译：自动语音识别范围内用于语言建模的文本数据选择
3. Literature Survey on Design and Implementation of Processing Model for Polarity Identification on Textual Data of English Language [J] . Aparna Trivedi, Apurva Srivastava, Ingita Singh, International Journal of Computer Science Issues . 2011,第6期

机译：英语文本数据极性识别处理模型设计与实现的文献综述
4. Acoustic and Textual Data Augmentation for Code-Switching Speech Recognition in Under-Resourced Language [C] . I-Ting Hsieh, Chung-Hsien Wu, Chun-Huang Wang Asia-Pacific Signal and Information Processing Association Annual Summit and Conference . 2020

机译：用于资源强度语言的代码切换语音识别的声学和文本数据增强
5. Generating Vocabulary Sets for Implicit Language Learning Using Masked Language Modeling [D] . Edgar, Vatricia. 2020

机译：使用屏蔽语言建模生成隐式语言学习的词汇集
6. Making Sense of Big Textual Data for Health Care: Findings from the Section on Clinical Natural Language Processing [O] . A. Névéol, P. Zweigenbaum 2017

机译：借助大型文本数据进行医疗保健：临床自然语言处理部分的发现
7. Data Models and Languages for Agent-Based Textual Information Dissemination [O] . M. Koubarakis, C. Tryfonopoulos, P. Raftopoulou, 2002

机译：基于代理的文本信息分发的数据模型和语言
8. Augmentations of Grammatical Categories in Distributional-Algebraic Models of Natural Languages [R] . Polkowska-Semeniuk, M., Polkowski, T. 1989

机译：自然语言分布 - 代数模型中语法范畴的增强

The Effect of Using Masked Language Models in Random Textual Data Augmentation

摘要

著录项

相似文献

相关主题

期刊订阅