首页> 外文期刊>Microchemical Journal: Devoted to the Application of Microtechniques in all Branches of Science >Validity of the best practice in splitting data for hold-out validation strategy as performed on the ink strokes in the context of forensic science
【24h】

Validity of the best practice in splitting data for hold-out validation strategy as performed on the ink strokes in the context of forensic science

机译:在法医科学背景下对墨水冲程执行的拆除验证策略的最佳实践的有效性

获取原文
获取原文并翻译 | 示例
           

摘要

External testing (ET), known also as the hold-out validation, is currently considered to be one of the most reliable ways to estimate predictive ability of a statistical model. One safeguard to prevent impermissible peeking in ET is to ensure all replicates of a particular sample is only included in either the test or the training set. Assuming a sample X1 consists of two replicates (i.e. X1a and X1b). The model is claimed to enjoy impermissible peeking if the X1a and X1b are split into the training and the test sets, respectively. Eventually, the resulting prediction model is expected to predict the test sets easily and presents an over-optimistic model performance. In forensic document examinations, an individual pen (IP) can be used to produce multiple ink strokes. In real-world practice, pens are manufactured via bulk production such that one big tank of ink is used to produce a wealth of IPs. In other words, ink strokes produced by varying IPs but of the same pen model are indeed originated from one single source (i.e. the same tank of ink). Eventually, with respect to the aforementioned safeguard, how shall one treat the ink strokes? Are they replicates or independent samples? In this context, the aim of the work is to investigate the validity of the safeguard in splitting dataset for hold-out validation strategy (i.e. ET) in the domain of forensic pen ink analysis. An infrared (IR) spectra of blue gel pen inks was used to demonstrate the practical aspect. The IR spectral data were collected from 1361 ink strokes that originated from 273 IPs of 23 pen models and 10 pen brands. Iterative stratified random sampling was employed to prepare 1000 pairs of training and test sets that were split at ratio 7:3 using two different principles: (a) set IP - selection was conducted at IP level to ensure all the ink strokes originated from a particular IP must be included into either the training or the test sets only; and (b) set NIP - ink strokes of a particular IP were a
机译:外部测试(ET),也称为阻止验证,目前被认为是估算统计模型预测能力的最可靠方式之一。一个防止彼得不允许偷看的保障是为了确保特定样本的所有重复仅包括在测试或培训集中。假设样品X1由两个复制(即x1a和x1b)组成。如果X1A和X1B分别分配到训练和测试集,则声称该模型允许允许不允许的偷看。最终,预期所得到的预测模型将容易地预测测试集并呈现过乐观的模型性能。在法医文献检查中,单个笔(IP)可用于产生多个墨水冲程。在真实的实践中,钢笔通过散装生产制造,使得一个大型墨水罐用于产生丰富的IP。换句话说,通过改变IPS但相同的笔模型产生的墨水冲程确实是来自一个单个源(即相同的墨水罐)。最终,关于上述保障,如何治疗墨水冲程?它们是否重复或独立样本?在这种情况下,该工作的目的是调查拆除验证策略(即Et)在法医笔墨水分析领域的拆分数据集中的保障措施的有效性。使用蓝色凝胶笔墨的红外线(IR)光谱来证明实际方面。从1361个墨水冲程收集IR光谱数据,该墨水冲程源于23个笔模型和10个笔品牌的273 IP。采用迭代分层随机采样制备1000对训练和测试集,其使用两种不同的原理分开7:3:(a)设置IP - 选择在IP水平下进行,以确保所有墨水冲程源自特定的墨水冲程IP必须仅包含在培训或测试集中; (b)设置了特定IP的墨水笔划是一个

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号