Fluency, adequacy, or HTER?: exploring different human judgments with a tunable MT metric

TitleFluency, adequacy, or HTER?: exploring different human judgments with a tunable MT metric
Publication TypeConference Papers
Year of Publication2009
AuthorsSnover M, Madnani N, Dorr BJ, Schwartz R
Conference NameProceedings of the Fourth Workshop on Statistical Machine Translation
Date Published2009///
PublisherAssociation for Computational Linguistics
Conference LocationStroudsburg, PA, USA
Abstract

Automatic Machine Translation (MT) evaluation metrics have traditionally been evaluated by the correlation of the scores they assign to MT output with human judgments of translation performance. Different types of human judgments, such as Fluency, Adequacy, and HTER, measure varying aspects of MT performance that can be captured by automatic MT metrics. We explore these differences through the use of a new tunable MT metric: TER-Plus, which extends the Translation Edit Rate evaluation metric with tunable parameters and the incorporation of morphology, synonymy and paraphrases. TER-Plus was shown to be one of the top metrics in NIST's Metrics MATR 2008 Challenge, having the highest average rank in terms of Pearson and Spearman correlation. Optimizing TER-Plus to different types of human judgments yields significantly improved correlations and meaningful changes in the weight of different types of edits, demonstrating significant differences between the types of human judgments.

URLhttp://dl.acm.org/citation.cfm?id=1626431.1626480