The aim of this paper is to investigate suitable evaluation strategies for the task of word-level quality estimation of machine translation. We suggest various metrics to replace F_1-score for the "BAD" class, which is currently used as main metric. We compare the metrics' performance on real system outputs and synthetically generated datasets and suggest a reliable alternative to the F_1-BAD score - the multiplication of F_1 -scores for different classes. Other metrics have lower discriminative power and are biased by unfair labellings.
展开▼