首页> 外文会议>Conference on Uncertainty in Artificial Intelligence >Adaptive Stratified Sampling for Precision-Recall Estimation
【24h】

Adaptive Stratified Sampling for Precision-Recall Estimation

机译:精密召回估计的自适应分层采样

获取原文

摘要

We propose a new algorithm for computing a constant-factor approximation of precisionrecall (PR) curves for massive noisy datasets produced by generative models. Assessing validity of items in such datasets requires human annotation, which is costly and must be minimized. Our algorithm, ADASTRAT, is the first data-aware method for this task. It chooses the next point to query on the PR curve adaptively, based on previous observations. It then selects specific items to annotate using stratified sampling. Under a mild monotonicity assumption, ADASTRAT outputs a guaranteed approximation of the underlying precision function, while using a number of annotations that scales very slowly with N, the dataset size. For example, when the minimum precision is bounded by a constant, it issues only log log N precision queries. In general, it has a regret of no more than log log N w.r.t. an oracle that issues queries at data-dependent (unknown) optimal points. On a scaled-up NLP dataset of 3.5M items, ADASTRAT achieves a remarkably close approximation of the true precision function using only 18 precision queries, 13× fewer than best previous approaches.
机译:我们提出了一种新的算法,用于计算由生成模型生产的大规模噪声数据集的PrecisionRecall(PR)曲线的恒因子近似。评估这些数据集中的物品的有效性需要人类注释,这是昂贵的,并且必须最小化。我们的算法Adastrat是此任务的第一个数据感知方法。它根据先前的观察选择了在PR曲线上查询的下一个点。然后,它选择特定项目以使用分层采样注释。在温和的单调性假设下,Adastrat输出了基础精度函数的保证近似,同时使用多个注释,与n,数据集大小非常缓慢地缩放。例如,当最小精度被常量界定时,它只发出日志日志n精确查询。通常,它的遗憾不超过日志日志n w.r.t.一个Oracle,在数据相关(未知)最佳点处发出查询。在3.5M项目的缩放NLP数据集上,Adastrat使用仅18个精确查询的真正精度函数的显着关闭近似,比最佳先前方法少13倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号