【24h】

Optimizing Deeper Transformers on Small Datasets

机译:在小型数据集上优化更深的变压器

获取原文

摘要

It is a common belief that training deep transformers from scratch requires large datasets. Consequently, for small datasets. people usually use shallow and simple additional layers on top of pre-trained models during tine-tuning. This work shows that this does not always need to be the case: with proper initialization and optimization, the benefits of very deep transformers can carry over to challenging tasks with small datasets, including Text-to-SQL semantic parsing and logical reading comprehension. In particular, we successfully train 48 layers of transformers, comprising 24 fine-tuned layers from pre-trained RoBERTa and 24 relation-aware layers trained from scratch. With fewer training steps and no task-specific pre-training. we obtain the state-of-the-art performance on the challenging cross-domain Text-to-SQL parsing benchmark Spider. We achieve this by deriving a novel Data-dependent Transformer Fixed-updatc initialization scheme (DT-Fixup), inspired by the prior T-Fixup work (Huang et al.. 2020). Further error analysis shows that increasing depth can help improve generalization on small datasets for hard cases that require reasoning and structural understanding.
机译:这是一种常见的信念,从头开始训练深层变压器需要大型数据集。因此,对于小型数据集。在Tine-Tuning期间,人们通常在预先训练的型号上使用浅薄和简单的额外层。这项工作表明,这种情况并不总是需要情况:通过正确的初始化和优化,非常深的变形金刚的优势可以携带到具有小型数据集的小型任务,包括文本到SQL语义解析和逻辑阅读理解。特别是,我们成功地培训了48层变压器,包括来自预先训练的罗伯塔和24层的微调层,并从头开始培训。较少培训步骤,没有特定于任务的预培训。我们在具有挑战性的跨域文本到SQL解析基准蜘蛛上获得最先进的表现。我们通过推导出一种新的数据依赖变压器固定UPDATC初始化方案(DT-FIXUP)来实现这一目标,由先前的T-FIXUP工作(Huang等人2020)启发。进一步的误差分析表明,增加深度可以帮助改善需要推理和结构理解的硬壳的小型数据集的泛化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号