首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Bayes Test of Precision, Recall, and F_1 Measure for Comparison of Two Natural Language Processing Models
【24h】

Bayes Test of Precision, Recall, and F_1 Measure for Comparison of Two Natural Language Processing Models

机译:两种自然语言处理模型比较的精度,召回率和F_1度量的贝叶斯检验

获取原文

摘要

Direct comparison on point estimation of the precision (P), recall (R), and F_1 measure of t-wo natural language processing (NLP) models on a common test corpus is unreasonable and results in less replicable conclusions due to a lack of a statistical test. However, the existing t-tests in cross-validation (CV) for model comparison are inappropriate because the distributions of P, R, F_1 are skewed and an interval estimation of P, R, and F_1 based on a t-test may exceed [0,1]. In this study, we propose to use a block-regularized 3 × 2 CV (3 × 2 BCV) in model comparison because it could regularize the difference in certain frequency distributions over linguistic units between training and validation sets and yield stable estimators of P, R, and F_1. On the basis of the 3 × 2 BCV, we calibrate the posterior distributions of P, R, and F_1 and derive an accurate interval estimation of P, R, and F_1. Furthermore, we formulate the comparison into a hypothesis testing problem and propose a novel Bayes test. The test could directly compute the probabilities of the hypotheses on the basis of the posterior distributions and provide more informative decisions than the existing significance t-tests. Three experiments with regard to NLP chunking tasks are conducted, and the results illustrate the validity of the Bayes test.
机译:在通用测试语料库上直接比较t-wo自然语言处理(NLP)模型的精度(P),召回率(R)和F_1量度的点估计是不合理的,并且由于缺少定律而导致可重复的结论较少统计检验。但是,现有的用于模型比较的交叉验证(CV)中的t检验不合适,因为P,R,F_1的分布存在偏差,并且基于t检验的P,R和F_1的区间估计可能会超过[ 0,1]。在这项研究中,我们建议在模型比较中使用块正则化的3×2 CV(3×2 BCV),因为它可以规范训练和验证集之间语言单元上某些频率分布的差异,并得出P的稳定估计量, R和F_1。基于3×2 BCV,我们校准P,R和F_1的后验分布,并得出P,R和F_1的精确区间估计。此外,我们将比较公式化为假设检验问题,并提出了新颖的贝叶斯检验。该检验可以根据后验分布直接计算假设的概率,并且比现有的显着性t检验提供更多的信息性决策。进行了关于NLP分块任务的三个实验,结果说明了贝叶斯测试的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号