Bayes Test of Precision, Recall, and F_1 Measure for Comparison of Two Natural Language Processing Models

机译：两种自然语言处理模型比较的精度，召回率和F_1度量的贝叶斯检验

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Direct comparison on point estimation of the precision (P), recall (R), and F_1 measure of t-wo natural language processing (NLP) models on a common test corpus is unreasonable and results in less replicable conclusions due to a lack of a statistical test. However, the existing t-tests in cross-validation (CV) for model comparison are inappropriate because the distributions of P, R, F_1 are skewed and an interval estimation of P, R, and F_1 based on a t-test may exceed [0,1]. In this study, we propose to use a block-regularized 3 × 2 CV (3 × 2 BCV) in model comparison because it could regularize the difference in certain frequency distributions over linguistic units between training and validation sets and yield stable estimators of P, R, and F_1. On the basis of the 3 × 2 BCV, we calibrate the posterior distributions of P, R, and F_1 and derive an accurate interval estimation of P, R, and F_1. Furthermore, we formulate the comparison into a hypothesis testing problem and propose a novel Bayes test. The test could directly compute the probabilities of the hypotheses on the basis of the posterior distributions and provide more informative decisions than the existing significance t-tests. Three experiments with regard to NLP chunking tasks are conducted, and the results illustrate the validity of the Bayes test.

机译：在通用测试语料库上直接比较t-wo自然语言处理（NLP）模型的精度（P），召回率（R）和F_1量度的点估计是不合理的，并且由于缺少定律而导致可重复的结论较少统计检验。但是，现有的用于模型比较的交叉验证（CV）中的t检验不合适，因为P，R，F_1的分布存在偏差，并且基于t检验的P，R和F_1的区间估计可能会超过[ 0,1]。在这项研究中，我们建议在模型比较中使用块正则化的3×2 CV（3×2 BCV），因为它可以规范训练和验证集之间语言单元上某些频率分布的差异，并得出P的稳定估计量， R和F_1。基于3×2 BCV，我们校准P，R和F_1的后验分布，并得出P，R和F_1的精确区间估计。此外，我们将比较公式化为假设检验问题，并提出了新颖的贝叶斯检验。该检验可以根据后验分布直接计算假设的概率，并且比现有的显着性t检验提供更多的信息性决策。进行了关于NLP分块任务的三个实验，结果说明了贝叶斯测试的有效性。

著录项

来源
《Annual meeting of the Association for Computational Linguistics》|2019年|4135-4145|共11页
会议地点
作者
Ruibo Wang; Jihong Li;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Investigating the Relationship between Language Model Perplexity and IR Precision-Recall Measures [J] . Leif Azzopardi, Mark Girolami, Keith van Risjbergen ACM SIGIR FORUM . 2003,第Special期

机译：调查语言模型的困惑度与IR精确召回措施之间的关系
2. Measuring Adoption of Patient Priorities–Aligned Care Using Natural Language Processing of Electronic Health Records: Development and Validation of the Model [J] . Javad Razjouyan, Jennifer Freytag, Lilian Dindo, JMIR Medical Informatics . 2021,第2期

机译：使用电子健康记录的自然语言处理测量采用患者优先级的护理：模型的开发和验证
3. Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing [J] . Edoardo Maria Ponti, Helen O’Horan, Yevgeni Berzak, Computational linguistics . 2019,第3期

机译：建模语言变化和普遍性：自然语言处理的类型学语言学调查
4. Bayes Test of Precision, Recall, and F_1 Measure for Comparison of Two Natural Language Processing Models [C] . Ruibo Wang, Jihong Li Annual meeting of the Association for Computational Linguistics . 2019

机译：贝叶斯精密，召回和F_1测量的测量，用于比较两种自然语言处理模型
5. A comparative study of three tests that measure receptive language processing ability in subjects ages eleven to twelve years, five months: The token test for children (Disimoni, 1978) the Fullerton language test for adolescents, oral commands subtest (Thorum, 1980) the clinical evaluation of language function, processing linguistic concepts and processing oral directions subtests (Semel & Wiig, 1980). [D] . Butchers, Helen. 1982

机译：对三种测量年龄在11至12岁，五个月的受试者中的接受语言处理能力的测试的比较研究：儿童的令牌测试（Disimoni，1978）青少年的Fullerton语言测试，口头命令子测试（Thorum，1980）临床语言功能评估，处理语言概念和处理口头指示子测验（Semel＆Wiig，1980）。
6. Can natural language processing help differentiate inflammatory intestinal diseases in China? Models applying random forest and convolutional neural network approaches [O] . Yuanren Tong, Keming Lu, Yingyun Yang, 2020

机译：自然语言加工可以有助于区分中国的炎症性肠疾病吗？应用随机森林和卷积神经网络方法的模型
7. Table 5: Comparison of the performances of our models (OUR-SVM) with Jensen’s (2009) models using eight predictors (J-LOG8) and one predictor (J-LOG1). Best Precision, Recall and F-measure values are in bold. [O] . -1

机译：表5：使用八个预测器（J-LOG8）和一个预测器（J-LOG1）的模型（2009）模型（J-LOG1）的模型（2009）模型的表现比较。最佳精度，召回和F测量值以粗体为单位。

Bayes Test of Precision, Recall, and F_1 Measure for Comparison of Two Natural Language Processing Models

摘要

著录项

相似文献

相关主题

期刊订阅