Help | Advanced Search

Computer Science > Computation and Language

Title: an overview on machine translation evaluation.

Abstract: Since the 1950s, machine translation (MT) has become one of the important tasks of AI and development, and has experienced several different periods and stages of development, including rule-based methods, statistical methods, and recently proposed neural network-based learning methods. Accompanying these staged leaps is the evaluation research and development of MT, especially the important role of evaluation methods in statistical translation and neural translation research. The evaluation task of MT is not only to evaluate the quality of machine translation, but also to give timely feedback to machine translation researchers on the problems existing in machine translation itself, how to improve and how to optimise. In some practical application fields, such as in the absence of reference translations, the quality estimation of machine translation plays an important role as an indicator to reveal the credibility of automatically translated target languages. This report mainly includes the following contents: a brief history of machine translation evaluation (MTE), the classification of research methods on MTE, and the the cutting-edge progress, including human evaluation, automatic evaluation, and evaluation of evaluation methods (meta-evaluation). Manual evaluation and automatic evaluation include reference-translation based and reference-translation independent participation; automatic evaluation methods include traditional n-gram string matching, models applying syntax and semantics, and deep learning models; evaluation of evaluation methods includes estimating the credibility of human evaluations, the reliability of the automatic evaluation, the reliability of the test set, etc. Advances in cutting-edge evaluation methods include task-based evaluation, using pre-trained language models based on big data, and lightweight optimisation models using distillation techniques.

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 01 September 2020

Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals

  • Martin Popel   ORCID: orcid.org/0000-0002-3628-8419 1   na1 ,
  • Marketa Tomkova   ORCID: orcid.org/0000-0001-9094-2365 2   na1 ,
  • Jakub Tomek   ORCID: orcid.org/0000-0002-0157-4386 3   na1 ,
  • Łukasz Kaiser   ORCID: orcid.org/0000-0003-1092-6010 4 ,
  • Jakob Uszkoreit   ORCID: orcid.org/0000-0001-5066-7530 4 ,
  • Ondřej Bojar   ORCID: orcid.org/0000-0002-0606-0050 1 &
  • Zdeněk Žabokrtský   ORCID: orcid.org/0000-0001-8149-4054 1  

Nature Communications volume  11 , Article number:  4381 ( 2020 ) Cite this article

52k Accesses

109 Citations

174 Altmetric

Metrics details

  • Communication
  • Computer science

The quality of human translation was long thought to be unattainable for computer translation systems. In this study, we present a deep-learning system, CUBBITT, which challenges this view. In a context-aware blind evaluation by human judges, CUBBITT significantly outperformed professional-agency English-to-Czech news translation in preserving text meaning (translation adequacy). While human translation is still rated as more fluent, CUBBITT is shown to be substantially more fluent than previous state-of-the-art systems. Moreover, most participants of a Translation Turing test struggle to distinguish CUBBITT translations from human translations. This work approaches the quality of human translation and even surpasses it in adequacy in certain circumstances.This suggests that deep learning may have the potential to replace humans in applications where conservation of meaning is the primary aim.

Similar content being viewed by others

thesis machine translation evaluation

Performance and perception: machine translation post-editing in Chinese-English news translation by novice translators

thesis machine translation evaluation

Machine translation of cortical activity to text with an encoder–decoder framework

thesis machine translation evaluation

Correcting spelling mistakes in Persian texts with rules and deep learning methods

Introduction.

The idea of using computers for translation of natural languages is as old as computers themselves 1 . However, achieving major success remained elusive, in spite of the unwavering efforts of the machine translation (MT) research over the last 70 years. The main challenges faced by MT systems are correct resolution of the inherent ambiguity of language in the source text, and adequately expressing its intended meaning in the target language (translation adequacy) in a well-formed and fluent way (translation fluency). Among key complications is the rich morphology in the source and especially in the target language 2 . For these reasons, the level of human translation has been thought to be the upper bound of the achievable performance 3 . There are also other challenges in recent MT research such as gender bias 4 or unsupervised MT 5 , which are mostly orthogonal to the present work.

Deep learning transformed multiple fields in the recent years, ranging from computer vision 6 to artificial intelligence in games 7 . In line with these advances, the field of MT has shifted to the use of deep-learning neural-based methods 8 , 9 , 10 , 11 , which replaced previous approaches, such as rule-based systems 12 or statistical phrase-based methods 13 , 14 . Relying on the vast amounts of training data and unprecedented computing power, neural MT (NMT) models can now afford to access the complete information available anywhere in the source sentence and automatically learn which piece is useful at which stage of producing the output text. This removal of past independence assumptions is the key reason behind the dramatic improvement of translation quality. As a result, neural translation even managed to considerably narrow the gap to human-translation quality on isolated sentences 15 , 16 .

In this work, we present a neural-based translation system CUBBITT (Charles University Block-Backtranslation-Improved Transformer Translation), which significantly outperformed professional translators on isolated sentences in a prestigious competition WMT 2018, namely the English–Czech News Translation Task 17 . We perform a new study with conditions that are more representative and far more challenging for MT, showing that CUBBITT conveys meaning of news articles significantly better than human translators even when the cross-sentence context is taken into account. In addition, we validate the methodological improvements using an automatic metric on English↔French and English↔Polish news articles. Finally, we provide insights into the principles underlying CUBBITT’s key technological advancement and how it improves the translation quality.

Deep-learning framework transformer

Our CUBBITT system (Methods 1) follows the basic Transformer encoder-decoder architecture introduced by Vaswani et al. 18 . The encoder represents subwords 19 in the source-language sentence by a list of vectors, automatically extracting features describing relevant aspects and relationships in the sentence, creating a deep representation of the original sentence. Subsequently, the decoder converts the deep representation to a new sentence in the target language (Fig.  1a , Supplementary Fig.  1 ).

figure 1

a The input sentence is converted to a numerical representation and encoded into a deep representation by a six-layer encoder, which is subsequently decoded by a six-layer decoder into the translation in the target language. Layers of the encoder and decoder consist of self-attention and feed-forward layers and the decoder also contains an encoder-decoder attention layer, with an input of the deep representation created by the last layer of encoder. b Visualization of encoder self-attention between the first two layers (one attention head shown, focusing on “magazine” and “her”). The strong attention link between ‘magazine’ and ‘gun’ suggests why CUBBITT ultimately correctly translates “magazine” as “zásobník” (gun magazine), rather than “časopis” (e.g., news magazine). The attention link between ‘woman’ and ‘her’ illustrates how the system internally learns coreference. c Encoder-decoder attention on the second layer of the decoder. Two heads are shown in different colors, each focusing on a different translation aspect which is described in italic. We note that the attention weights were learned spontaneously by the network, not inputted a priori.

A critical feature of the encoder and decoder is self-attention, which allows identification and representation of relationships between sentence elements. While the encoder attention captures the relationship between the elements in the input sentence (Fig.  1b ), the encoder-decoder attention learns the relationship between elements in the deep representation of the input sentence and elements in the translation (Fig.  1c ). In particular, our system utilizes the so-called multi-head attention, where several independent attention functions are trained at once, allowing representation of multiple linguistic phenomena. These functions may facilitate, for example, the translation of ambiguous words or coreference resolution.

Utilizing monolingual data via backtranslation

The success of NMT depends heavily on the quantity and quality of the training parallel sentences (i.e., pairs of sentences in the source and target language). Thanks to long-term efforts of researchers, large parallel corpora have been created for several language pairs, e.g., the Czech-English corpus CzEng 20 or the multi-lingual corpus Opus 21 . Although millions of parallel sentences became freely available in this way, this is still not sufficient. However, the parallel data can be complemented by monolingual target-language data, which are usually available in much larger amounts than the parallel data. CUBBITT leverages the monolingual data using a technique termed backtranslation, where the monolingual target-language data are machine translated to the source language, and the resulting sentence pairs are used as additional (synthetic) parallel training data 19 . Since the target side in backtranslation are authentic sentences originally written in the target language, backtranslation can improve fluency (and sometimes even adequacy) of the final translations by naturally learning the language model of the target language.

CUBBITT is trained with backtranslation data in a novel block regime (block-BT), where the training data are presented to the neural network in blocks of authentic parallel data alternated with blocks of synthetic data. We compared our block regime to backtranslation using the traditional mixed regime (mix-BT), where all synthetic and authentic sentences are mixed together in random order, and evaluated the learning curves using BLEU, an automatic measure, which compares the similarity of an MT output to human reference translations (Methods 2–13). While training with mix-BT led to a gradually increasing learning curve, block-BT showed further improved performance in the authentic training phases, alternated with reduced performance in the synthetic ones (Fig.  2a , thin lines). In the authentic training phases, block-BT was better than mix-BT, suggesting that a model extracted at the authentic-data phase might perform better than mix-BT trained model.

figure 2

a The effect of averaging eight last checkpoints with block-BT and mix-BT on the translation quality as measured by BLEU on the development set WMT13 newstest. The callouts (pointing to the initial and final peaks of the block-BT + avg8 curve) illustrate the 8 averaged checkpoints (synth-trained ones as brown circles, auth-trained ones as violet circles). b Diagram of iterated backtranslation: the system MT1 trained only on authentic parallel data is used to translate monolingual Czech data into English, which are used to train system MT2; this step can be iterated one or more times to obtain MT3, MT4, etc. The block-BT + avg8 model shown in a is the MT2 model in (B) and in Supplementary Fig.  2 . c BLEU results on WMT17 test-set relative to the WMT17 winner UEdin2017. All five systems use checkpoint averaging.

CUBBITT combines block-BT with checkpoint averaging, where networks in the eight last checkpoints are merged together using arithmetic average, which is a very efficient approach to gain better stability, and by that improve the model performance 18 . Importantly, we observed that checkpoint averaging works in synergy with the block-BT. The BLEU improvement when using this combination is clearly higher than the sum of BLEU improvements by the two methods in separation (Fig.  2a ). The best performance was gained when averaging authentic-trained model and synthetic-trained models in the ratio of 6:2; interestingly, the same ratio turned out to be optimal across several occasions in training. This also points out an advantage of block-BT combined with checkpoint averaging: the method automatically finds the optimal ratio of the two types of synthetic/authentic-trained models, as it evaluates all the ratios during training (Fig.  2a ).

The final CUBBITT system was trained using iterated block-BT (Fig.  2b , Supplementary Fig.  2 ). This was accompanied by other steps, such as data filtering, translationese tuning, and simple regex postprocessing (Methods 11). Evaluating the individual components of CUBBITT automatically on a previously unseen test-set from WMT17 showed a significant improvement in BLEU over UEdin2017, the state-of-the-art system from 2017 (Fig.  2c ).

Evaluation: CUBBITT versus a professional agency translation

In 2018, CUBBITT won the English→Czech and Czech→English news translation task in WMT18 17 , surpassing not only its machine competitors, but it was also the only MT system, which significantly outperformed the reference human translation by a professional agency in WMT18 English→Czech news translation task (other language pairs were not evaluated in such a way to allow comparison with the human reference) (Fig.  3a ). Since this result is highly surprising, we decided to investigate it in greater detail, evaluating potential confounding factors and focusing at how it can be explained and interpreted. We first confirmed that the results are not due to the original language of the reference sentences being English in half of the evaluated sentences and Czech in the other half of the test dataset (Supplementary Fig.  4 ; Methods 13), which was proposed to be a potential confounding factor by the WMT organizers 17 and others 22 , 23 .

figure 3

a Results from context-unaware evaluation in WMT18, showing distributions of source-based direct assessment (SrcDA) of five MT systems and human reference translation, sorted by average score. CUBBITT was submitted under the name CUNI-Transformer. Online G, A, and B are three anonymized online MT systems. b Translations by CUBBITT and human reference were scored by six non-professionals in the terms of adequacy, fluency and overall quality in a context-aware evaluation. The evaluation was blind, i.e., no information was provided on whether the translations are human or machine translated. The scores (0–10) are shown as violin plots with boxplots (median + interquartile range), while the boxes below represent the percentage of sentences scored better in reference (orange), CUBBITT (blue), or the same (gray); the star symbol marks the ratio of orange vs. blue, ignoring gray. Sign test was used to evaluate difference between human and machine translation. c As in a , but evaluation by six professional translators. *** P  < 0.001; ** P  < 0.01; * P  < 0.05.

An important drawback in the WMT18 evaluation was the lack of cross-sentence context, as sentences were evaluated in random order and without document context. While the participating MT systems translated individual sentences independently, the human reference was created as a translation of the entire documents (news articles). The absence of cross-sentence context in the evaluation was recently shown to cause an overestimation of the quality of MT translations compared to human reference 22 , 23 . For example, evaluators will miss MT errors that would be evident only from the cross-sentence context, such as gender mismatch or incorrect translation of an ambiguous expression. On the other hand, independent evaluation of sentences translated considering cross-sentence context might unfairly penalize reference translations for moving pieces of information across sentences boundaries, as this will appear as an omission of meaning in one sentence and an addition in another.

We therefore conducted a new evaluation, using the same English→Czech test dataset of source documents, CUBBITT translations, and human reference translations, but presenting the evaluators with not only the evaluated sentences but also the document context (Methods 14–18; Supplementary Fig.  5 ). In order to gain further insight into the results, we asked the evaluators to assess the translations in terms of adequacy (the degree to which the meaning of the source sentence is preserved in the translation), fluency (how fluent the sentence sounds in the target language), as well as the overall quality of the translations. Inspired by a recent discussion of the translation proficiency of evaluators 22 , we recruited two groups of evaluators: six professional translators (native in the target language) and seven non-professionals (with excellent command of the source language and native in the target language). An additional exploratory group of three translation theoreticians was also recruited. In total, 15 out of the 16 evaluators passed a quality control check, giving 7824 sentence-level scores on 53 documents in total. See Methods 13–18 for further technical details of the study.

Focusing first at evaluations by non-professionals as in WMT18, but in our context-aware assessment, CUBBITT was evaluated to be significantly better than the human reference in adequacy ( P  = 4.6e-8, sign test) with 52% of sentences scored better and only 26% of sentences scored worse (Fig.  3b ). On the other hand, the evaluators found human reference to be more fluent ( P  = 2.1e-6, sign test), evaluating CUBBITT better in 26% and worse in 48% (Fig.  3b ). In the overall quality, CUBBITT nonsignificantly outperformed human reference ( P  = 0.6, sign test, 41% better than reference, 38% worse; Fig.  3b ).

In the evaluation by professional translators, CUBBITT remained significantly better in adequacy than human reference ( P  = 7.1e-4, sign test, 49% better, 33% worse; Fig.  3c ), albeit it scored worse in both fluency ( P  = 3.3e-19, sign test, 23% better, 64% worse) and overall quality ( P  = 3.0e-7, sign test, 32% better, 56% worse; Fig.  3c ). Fitting a linear model of weighting adequacy and fluency in the overall quality suggests that professional translators value fluency more than non-professionals; this pattern was also observed in the exploratory group of translation theoreticians (Supplementary Fig.  6 ). Finally, when scores from all 15 evaluators were pooled together, the previous results were confirmed: CUBBITT outperformed the human reference in adequacy, whereas the reference was scored better in fluency and overall quality (Supplementary Fig.  7 ). Surprisingly, we observed a weak, but significant effect of sentence length, showing that CUBBITT’s performance is more favorable compared to human in longer sentences with regards to adequacy, fluency, and overall quality (Supplementary Fig.  8 , including an example of a well-translated complex sentence).

We next decided to perform additional evaluation that would allow us to better understand where and why our machine translations are better or worse than the human translations. We asked three professional translators and three non-professionals to add annotations of types of errors in the two translations (Methods 19). In addition, the evaluators were asked to indicate whether the translation was wrong because of cross-sentence context.

CUBBITT made significantly fewer errors in addition of meaning, omission of meaning, shift of meaning, other adequacy errors, grammar, and spelling (Fig.  4a , example in Fig.  5a–c , Supplementary Data  1 ). On the other hand, reference performed better in error classes other fluency errors and ambiguous words (Fig.  4a , Supplementary Fig.  9 , examples in Fig.  5d, e , Supplementary Data  1 ). As expected, CUBBITT made significantly more errors due to cross-sentence context (11.7% compared to 5.2% in reference, P  = 1.2e-10, sign test, Fig.  4a ), confirming the importance of context-aware evaluation of translation quality. Interestingly, when only sentences without context errors are taken into account, not only adequacy, but also the overall quality is significantly better in CUBBITT compared to reference in ratings by non-professionals ( P  = 0.001, sign test, 49% better, 29% worse; Supplementary Fig.  10 ), in line with the context-unaware evaluation in WMT18.

figure 4

a Percentages of sentences with various types of errors are shown for translations by human reference and CUBBITT. Errors in 405 sentences were evaluated by six evaluators (three professional translators and three non-professionals). Sign test was used to evaluate difference between human and machine translation. b Translations by five machine translation systems were scored by five professional translators in the terms of adequacy and fluency in a blind context-aware evaluation. The systems are sorted according to the mean performance, and the scores (0–10) for individual systems are shown as violin plots with boxplots (median + interquartile range). For each pair of neighboring systems, the box in between them represents the percentage of sentences scored as better in one, the other, or the same in both (gray). The star symbol marks the ratio when ties are ignored. Sign test was used to evaluate difference between the pairs of MT systems. *** P  < 0.001; ** P  < 0.01; * P  < 0.05.

figure 5

The Czech translations by the human reference and CUBBITT, as well as the values of the manual evaluation for the individual sentences, are shown in Supplementary Data  1 .

We observed that the type of document, e.g., business vs. sports articles, can also affect the quality of machine translation when compared to human translation (Methods 18). The number of evaluated documents (53) does not allow for any strong and significant conclusions at the level of whole documents, but the document-level evaluations nevertheless suggest that CUBBITT performs best in news articles about business and politics (Supplementary Fig.  11A-B ). Conversely, it performed worst in entertainment/art (both in adequacy and fluency) and in news articles about sport (in fluency). Similar results can be observed also in sentence-level evaluations across document types (Supplementary Fig.  11C–D ).

The fact that translation adequacy is the main strength of CUBBITT is surprising, as NMT was shown to improve primarily fluency over the previous approaches 24 . We were therefore interested in comparison of fluency of translations made by CUBBITT and previous state-of-the-art MT systems (Methods 20). We performed an evaluation of CUBBITT in a side-by-side direct comparison with Google Translate 15 (an established benchmark for MT) and UEdin 25 (the winning system in WMT2017 and a runner-up in WMT 2018). Moreover, we included a version of basic Transformer with one iteration of mix-BT, and another version of basic Transformer with block-BT (but without iterated block-BT), providing human rating of different approaches to backtranslation. The evaluators were asked to evaluate adequacy and fluency of the five presented translations (again in a blind setting and taking cross-sentence context into account).

In the context-aware evaluation of the five MT systems, CUBBITT significantly outperformed Google Translate and UEdin both in adequacy (mean increase by 2.4 and 1.2 points, respectively) and fluency (mean increase by 2.1 and 1.2 points, respectively) (Fig.  4b ). The evaluation also shows that this increase of performance stems from inclusion of several components of CUBBITT: the Transformer model and basic (mix-BT) backtranslation, replacement of mix-BT with block-BT (adequacy: mean increase by 0.4, P  = 3.9e-5; fluency: mean increase by 0.3, P  = 1.4e-4, sign test), and to a lesser extent also other features in the final CUBBITT system, such as iterated backtranslation or data filtering (adequacy: mean increase by 0.2, P  = 0.054; fluency: mean increase by 0.1, P  = 0.233, sign test).

Finally, we were interested to see whether CUBBITT translations are distinguishable from human translations. We therefore conducted a sentence-level Translation Turing test, in which participants were asked to judge whether a translation of a sentence was performed by a machine or a human on 100 independent sentences (the source sentence and a single translation was shown; Methods 21). A group of 16 participants were given machine translations by Google Translate system mixed in a 1:1 ratio with reference translations. In this group, only one participant (with accuracy of 61%) failed to significantly distinguish between machine and human translations, while the other 15 participants recognized human translations in the test (with accuracy reaching as high as 88%; Fig.  6 ). In a group of different 15 participants, who were presented machine translations by CUBBITT mixed (again in the 1:1 ratio) with reference translations, nine participants did not reach the significance threshold of the test (with the lowest accuracy being 56%; Fig.  6 ). Interestingly, CUBBITT was not significantly distinguished from human translations by three professional translators, three MT researchers, and three other participants. One potential contributor to human-likeness of CUBBITT could be the fact that it is capable of restructuring translated sentences where the English structure would sound unnatural in Czech (see an example in Fig.  5f , Supplementary Data  1 ).

figure 6

a Accuracy of individual participants in distinguishing machine from human translations is shown in a bar graph. Fisher test was used to assess whether the participant significantly distinguished human and machine translations and Benjamini–Hochberg method was used to correct for multiple testing. Participants with a Q value below 0.05 were considered to have significantly distinguished between human and machine translations. b Percentage of participants, who significantly distinguished human and machine translations for CUBBITT (top, blue) and for Google Translate (bottom, green).

Generality of block backtranslation

Block-BT with checkpoint averaging clearly improves English→Czech news translation quality. To demonstrate that the benefits of our approach are not limited to this language pair, we trained English→French, French→English, English→Polish, and Polish→English versions of CUBBITT (Methods 4, 5, 12) and evaluated them using BLEU as in Fig.  2a . The results are consistent with the behavior on the English→Czech language pair, showing a synergistic benefit of block-BT with checkpoint averaging (Fig.  2a , Supplementary Figs.  3 , 14 ).

How block backtranslation improves translation

Subsequently, we sought to investigate the synergy between block-BT and checkpoint averaging, trying to get an insight into the mechanism of how this improves translation on the English→Czech language pair. We first tested a simple hypothesis that the only benefit of block regime and checkpoint averaging is an automatic detection of the optimal ratio of authentic and synthetic data, given that in block-BT the averaging window explores various ratios of networks trained on authentic and synthetic data. Throughout our experiments, the optimal ratio of authentic and synthetic blocks was ca. 3:1, so we hypothesized that mixed-BT would benefit from authentic and synthetic data mixed in the same ratio. However, this hypothesis was not supported by additional explorations (Supplementary Fig.  15 ), which suggests that a more profound mechanism underlies the synergy.

We next hypothesized that training the system in the block regime compared to the mix regime might aid the network to better focus at the two types of blocks (authentic and synthetic), one at a time. This would allow the networks to more thoroughly learn the properties and benefits of the two blocks, leading to a better exploration of space of networks, ultimately yielding greater translation diversity during training. We measured translation diversity of a single sentence as the number of all unique translations produced by the MT system at hourly checkpoints during training. Comparing translation diversity between block-BT and mix-BT on the WMT13 newstest, we observed block-BT to have greater translation diversity in 78% sentences, smaller in 18% sentences, and equal in the remaining 4% sentences (Methods 22–23), supporting the hypothesis of greater translation diversity of block-BT compared to mix-BT.

The increased diversity could be leveraged by checkpoint averaging by multiple means. In theory, this can be as simple as selecting the most frequent sentence translation among the eight averaged checkpoints. At the same time, checkpoint averaging can generate sentences that were not present as the preferred translation in any of the eight averaged checkpoints (termed novel Avg8 translation), potentially combining the checkpoints’ best translation properties. This may involve producing a combination of phrase translations seen in the averaged checkpoints (Fig.  7a , Supplementary Fig.  17 ), or creation of a sentence with phrases not seen in any of the averaged checkpoints (Fig.  7b ). The fact that even phrase translations with low frequency in the eight averaged checkpoints can be chosen by checkpoint averaging stems from the way the confidence of the networks in their translations is taken into account (Supplementary Fig.  18 ).

figure 7

a A case where the translation resulting from checkpoint averaging is a crossover of translations present in AUTH and SYNTH blocks. All the mentioned translations are shown in Supplementary Fig.  17 . b A case where the translation resulting from checkpoint averaging contains a phrase that is not the preferred translation in any of the averaged checkpoints.

Comparing the translations produced by models with and without averaging, we observed that averaging generated at least one translation never seen without averaging (termed novel Avg∞ ) in 60% sentences in block-BT and in 31.6% sentences in mix-BT (Methods 23). Moreover, averaging generated more novel Avg∞ translations in block-BT than mix-BT in 55% sentences, fewer in only 6%, and equal in 39%.

We next sought to explore what is the mechanism of the greater translation diversity and more novel Avg translations in block-BT compared to mix-BT. We therefore computed how translation diversity and novel Avg8 translations develop over time during training and what is their temporal relationship to blocks of authentic and synthetic data (Methods 24). In order to be able to track these features over time, we computed diversity and novel Avg8 using the last eight checkpoints (the width of the averaging window) for each checkpoint during training. While mix-BT gradually and smoothly decreased in both metrics over time, block-BT showed a striking difference between the alternating blocks of authentic and synthetic data (Fig.  8a , Supplementary Fig.  16 ). The novel Avg8 translations in block-BT were most frequent in checkpoints where the eight averaged checkpoints contained both the authentic- and synthetic-trained blocks (Fig.  8a ). Interestingly, also the translation diversity of the octuples of checkpoints in block-BT (without averaging) was highest at the borders of the blocks (Supplementary Fig.  16 ). This suggests that it is the alternation of the blocks that increases the diversity of the translations and generation of novel translations by averaging in block-BT.

figure 8

a Percentage of WMT13 newstest sentences with novel Avg8 translation (not seen in the previous eight checkpoints without averaging) over time, shown separately for block-BT (red) and mix-BT (blue). The checkpoints trained in AUTH blocks are denoted by magenta background and letter A, while the SYNTH blocks are shown in yellow background and letter S. b Evaluation of translation quality by BLEU on WMT13 newstest set for four different versions of block-BT (left) and mix-BT (right), exploring the importance of novel Avg8 sentences created by checkpoint averaging. The general approach is to take the best system using checkpoint averaging (Avg), and substitute translations of novel Avg8 and not-novel Avg8 sentences with translations produced by the best system without checkpoint averaging (noAvg), observing the effect on BLEU. In blue is the BLEU achieved by the model with checkpoint averaging, while in purple is the BLEU achieved by the model without checkpoint averaging. In red is the BLEU of a system, which used checkpoint averaging, but where the translations that are not novel Avg8 were replaced by the translations produced by the system without checkpoint averaging. Conversely, yellow bars show BLEU of a system, which uses checkpoint averaging, but where the novel Avg8 translations were replaced by the version without checkpoint averaging.

Finally, we tested whether the generation of novel translations by averaging contributes to the synergy between block regime and checkpoint averaging as measured by BLEU (Methods 25). We took the best model in block-BT with checkpoint averaging (block-BT-Avg; BLEU 28.24) and in block-BT without averaging (block-BT-NoAvg; BLEU 27.54). We next identified 988 sentences where the averaging in block-BT-Avg generated a novel Avg8 translation, unseen in the eight previous checkpoints without averaging. As we wanted to know what role do the novel Avg8 sentences play in the improved BLEU of block-BT-Avg compared to block-BT-NoAvg (Fig.  2a ), we next computed BLEU of block-BT-Avg translations, where the translations of 988 novel Avg8 sentences were replaced with the block-BT-NoAvg translations. Such replacement led to decrease of BLEU almost to the level of block-BT-NoAvg (BLEU 27.65, Fig.  8b ). Conversely, replacement of the 2012 not-novel Avg8 sentences resulted in only a small decrease (BLEU 28.13, Fig.  8b ), supporting the importance of novel translations in the success of block-BT with checkpoint averaging. For a comparison, we repeated the same analysis with mix-BT and observed that replacement of novel Avg8 sentences in mix-BT showed a negligible effect on the improvement of mix-BT-Avg over mix-BT-NoAvg (Fig.  8b ).

Altogether, our analysis shows that generation of novel sentences is an important mechanism of how checkpoint averaging combined with block-BT lead to synergistically improved performance. Specifically, averaging at the interface between authentic and synthetic blocks leads to the highest diversity and generation of novel translations, allowing combining the best features of the diverse translations in the two block types (examples in Fig.  7 , Supplementary Fig.  17 ).

In this work, we have shown that the deep-learning framework CUBBITT outperforms a professional human-translation agency in adequacy of English→Czech news translations. In particular, this is achieved by making fewer errors in adding, omitting, or shifting meaning of the translated sentences. At the same time, CUBBITT considerably narrowed the gap in translation fluency to human, markedly outperforming previous state-of-the-art translation systems. The fact that the main advantage of CUBBITT is improved adequacy could be viewed as surprising, as it was thought that the main strength of NMT was increased fluency 24 . However, our results are in line with the study of Läubli et al. 23 , who observed the deficit of NMT to human to be smaller in adequacy than in fluency. The improvement in translation quality is corroborated by a Translation Turing test, where most participants failed to reliably discern CUBBITT translations from human.

Critically, our evaluation of translation quality was carried out in a fully context-aware evaluation setting. As discussed in this work and in other recent articles on this topic 22 , 23 , the previous standard approach of combining context-aware reference translation with context-free assessment gives an unfair advantage to machine translation. Consequently, this study is also an important contribution to MT evaluation practices and points out that the relevance of future evaluations in MT competitions such as WMT will be increased when cross-sentence context is included. In addition, our design where fluency and adequacy are assessed separately, and by professional translators and non-professionals, brings interesting insight into evaluator priorities. The professional translators were observed to be more sensitive to errors in fluency than non-professionals and to have a stronger preference for fluency when rating overall quality of a translation. Such difference in preference is an important factor in designing studies, which measure solely the overall translation quality. While in domains such as artistic writing, fluency is clearly of utmost importance, there are domains (e.g., factual news articles), where an improvement in preservation of meaning may be more important to a reader than a certain loss of fluency. Our robust context-aware evaluation with above-human performance in adequacy demonstrates that human translation is not necessarily an upper bound of translation quality, which was a long-standing dogma in the field.

Among key methodological advances of CUBBITT is the training regime termed block backtranslation, where blocks of authentic data alternate with blocks of synthetic data. Compared to traditional mixed backtranslation, where all the data are shuffled together, block regime offers markedly increased diversity of translations produced during training, suggesting a more explorative search for solutions to the translation problem. This increased diversity can be then greatly leveraged by the technique of checkpoint averaging, which is capable of finding consensus between networks trained on purely synthetic data and networks trained on authentic data, often combining the best of the two worlds. We speculate that such block-training regime of training may be beneficial also for other ways of data organization into blocks and may in theory be applicable beyond backtranslation, or even beyond the field of machine translation.

During reviews of this manuscript, the WMT19 competition took place 26 . The testing dataset was different, and evaluation methodology was innovated compared to WMT18, which is why the results are not directly comparable (e.g., the translation company was explicitly instructed to not add/remove information from the translated sentences, which was a major source of adequacy errors in this study (Fig.  4a )). Also based on discussions with our team’s members, the organizers of WMT19 implemented a context-aware evaluation. In this context-aware evaluation of English→Czech news task, CUBBITT was the winning MT system and reached overall quality score 95.3% of human translators (DA score 86.9 vs 91.2), which is similar to our study (94.8%, mean overall quality 7.4 vs 7.8, all annotators together). Given that WMT19 did not separate overall quality into adequacy and fluency, it is not possible to validate the potential super-human adequacy on their dataset.

Our study was performed on English→Czech news articles and we have also validated the methodological improvements of CUBBITT using automatic metric on English↔French and English↔Polish news articles. The generality of CUBBITT’s success with regards to other language pairs and domains remains to be evaluated. However, the recent results from WMT19 on English→German show that indeed also in other languages the human reference is not necessarily the upper bound of translation quality 26 .

The performance of machine translation is getting so close to human reference that the quality of the reference translation matters. Highly qualified human translators with infinite amount of time and resources will likely produce better translations than any MT system. However, many clients cannot afford the costs of such translators and instead use services of professional translation agencies, where the translators are under certain time pressure. Our results show that the quality of professional-agency translations is not unreachable by MT, at least in certain aspects, domains, and languages. Nevertheless, we suggest that in the future MT competitions and evaluations, it may be important to sample multiple human references (from multiple agencies and ideally also prices).

We stress out that CUBBITT is the result of years of open scientific collaboration and is a culmination of the transformation of the field. It started with the MT competitions that provided open data and ideas and continued through the open community of deep learning, which provided open-source code. The Transformer architecture significantly lowered the hardware requirements for training MT models (from months on multi-GPU clusters to days on a single machine 18 ). More effective utilization of monolingual data via iterated block backtranslation with checkpoint averaging presented in this study allows generating large amount of high-quality synthetic parallel data to complement existing parallel datasets at little cost. Together, these techniques allow CUBBITT to be trained by the broad community and to considerably extend the reach of MT.

1 CUBBITT model

Our CUBBITT translation system follows the Transformer architecture (Fig.  1 , Supplementary Fig.  1 ) introduced in Vaswani et al. 18 . Transformer has an encoder-decoder structure where the encoder maps an input sequence of tokens (words or subword units) to a sequence of continuous deep representations z . Given z , the decoder generates an output sequence of tokens one element at a time. The decoder is autoregressive, i.e., consuming the previously generated symbols as additional input when generating the next token.

The encoder is composed of a stack of identical layers, with each layer having two sublayers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection around each of the two sublayers, followed by layer normalization. The decoder is also composed of a stack of identical layers. In addition to the two sublayers from the encoder, the decoder inserts a third sublayer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sublayers, followed by layer normalization.

The self-attention layer in the encoder and decoder performs multi-head dot-product attention, each head mapping matrices of queries ( Q ), keys ( K ), and values ( V ) to an output vector, which is a weighted sum of the values V :

where Q ∈ \({\Bbb R}^{n \times d_k}\) , K ∈ \({\Bbb R}^{n \times d_k}\) , V ∈ \({\Bbb R}^{n \times d_v}\) , n is the sentence length, d v is the dimension of values, and d k is the dimension of the queries and keys. Attention weights are computed as a compatibility of the corresponding key and query and represent the relationship between deep representations of subwords in the input sentence (for encoder self-attention), output sentence (for decoder self-attention), or between the input and output sentence (for encoder-decoder attention). In encoder and decoder self-attention, all queries, keys and values come from the output of the previous layer, whereas is the encoder-decoder attention, keys and values come from the encoder’s topmost layer and queries come from the decoder’s previous layer. In the decoder, we modify the self-attention to prevent it from attending to following positions (i.e., rightward from the current position) by adding a mask, because the following positions will not be known in inference time.

2 English–Czech training data

Our training data are constrained to the data allowed in the WMT 2018 News translation shared task 17 ( www.statmt.org/wmt18/translation-task.html ). Parallel (authentic) data are: CzEng 1.7, Europarl v7, News Commentary v11 and CommonCrawl. Monolingual data for backtranslation are: English (EN) and Czech (CS) NewsCrawl articles. Data sizes (after filtering, see below) are reported in Supplementary Table  1 .

While all our monolingual data are news articles, only less than 1% of our parallel data are news (summing News Commentary v12 and the news portion of CzEng 1.7). The biggest sources of our parallel data are: movie subtitles (63% of sentences), EU legislation (16% of sentences), and Fiction (9% of sentences) 27 . Unfortunately, no finer-grained metadata specifying the exact training-data domains (such as politics, business, and sport) are available.

We filtered out ca. 3% of sentences in the monolingual data by restricting the length to 500 characters and in case of Czech NewsCrawl also by keeping only sentences containing at least one accented character (using a regular expression m/[ěščřžýáíéúůd’t’ň]/i). This simple heuristic is surprisingly effective for Czech; it filters out not only sentences in other languages than Czech, but also various non-linguistic content, such as lists of football or stock-market results.

We divided the Czech NewsCrawl (synthetic data) into two parts: years 2007–2016 (58,231 k sentences) and year 2017 (7152 k sentences). When training block-BT, we simply concatenated four blocks of training data: authentic, synthetic 2007–2016, authentic and synthetic 2017. The sentences within these four blocks were randomly shuffled; we only do not shuffle across the data blocks. When training mix-BT, we used exactly the same training sentences, but we shuffled them fully. This means we upsampled the authentic training data two times. The actual ratio of authentic and synthetic data (as measured by the number of subword tokens) in the mix-BT training data was approximately 1.2:1.

3 English–Czech development and test data

WMT shared task on news translation provides a new test-set (with ~3000 sentences) each year collected from recent news articles (WMT = Workshop on statistical Machine Translation. In 2016, WMT was renamed to Conference on Machine Translation, but keeping the legacy abbreviation WMT. For more information see the WMT 2018 website http://www.statmt.org/wmt18 .). The reference translations are created by professional translation agencies. All of the translations are done directly, and not via an intermediate language. Test sets from previous years are allowed to be used as development data in WMT shared tasks.

We used WMT13 (short name for WMT newstest2013) as the primary development set in our experiments (e.g., Figure  2a ). We used WMT17 as a test-set for measuring BLEU scores in Fig.  2c . We used WMT18 (more precisely, its subset WMT18-orig-en, see below) as our final manual-evaluation test-set. Data sizes are reported in Supplementary Table  2 .

In WMT test sets since 2014, half of the sentences for a language pair X-EN originate from English news servers (e.g., bbc.com) and the other half from X-language news servers. All WMT test sets include the server name for each document in metadata, so we were able to split our dev and test sets into two parts: originally Czech (orig-cs, for Czech-domain articles, i.e., documents with docid containing “.cz”) and originally English (orig-en, for non-Czech-domain articles. The WMT13-orig-en part of our WMT13 development set contains not only originally English articles, but also articles written originally in French, Spanish, German and Russian. However, the Czech reference translations were translated from English. In WMT18-orig-en, all the articles were originally written in English.).

According to Bojar et al. 17 , the Czech references in WMT18 were translated from English “by the professional level of service of Translated.net, preserving 1–1 segment translation and aiming for literal translation where possible. Each language combination included two different translators: the first translator took care of the translation, the second translator was asked to evaluate a representative part of the work to give a score to the first translator. All translators translate towards their mother tongue only and need to provide a proof or their education or professional experience, or to take a test; they are continuously evaluated to understand how they perform on the long term. The domain knowledge of the translators is ensured by matching translators and the documents using T-Rank, http://www.translated.net/en/T-Rank .”

Toral et al. 22 furthermore warned about post-edited MT used as human references. However, Translated.net confirmed that MT was completely deactivated during the process of creating WMT18 reference translations (personal communication).

4 English–French data

The English–French parallel training data were downloaded from WMT2014 ( http://statmt.org/wmt14/translation-task.html ). The monolingual data were downloaded from WMT 2018 (making sure there is no overlap with the development and test data). We filtered the data for being English/French using the langid toolkit ( http://pypi.org/project/langid/ ). Data sizes after filtering are reported in Supplementary Table  3 . When training English–French block-BT, we concatenated the French NewsCrawl2008–2014 (synthetic data) and authentic data, with no upsampling. When training French–English block-BT, we split the English NewsCrawl into three parts: 2011–2013, 2014–2015, and 2016–2017 and interleaved with three copies of the authentic training data, i.e., upsampling the authentic data three times. We always trained mix-BT on a fully shuffled version of the data used for the respective block-BT training.

Development and test data are reported in Supplementary Table  4 .

5 English–Polish data

The English–Polish training and development data were downloaded from WMT2020 ( http://statmt.org/wmt20/translation-task.html ). We filtered the data for being English/Polish using the FastText toolkit ( http://pypi.org/project/fasttext/ ). Data sizes after filtering are reported in Supplementary Table  5 . When training English–Polish block-BT, we upsampled the authentic data two times and concatenated with the Polish NewsCrawl2008–2019 (synthetic data) upsampled six times. When training Polish–English block-BT, we upsampled the authentic data two times and concatenated with English NewsCrawl2018 (synthetic data, with no upsampling). We always trained mix-BT on a fully shuffled version of the data used for the respective block-BT training.

Development and test data are reported in Supplementary Table  6 .

6 CUBBITT training: BLEU score

BLEU 28 is a popular automatic measure for MT evaluation and we use it for hyperparameter tuning. Similarly to most other automatic MT measures, BLEU estimates the similarity between the system translation and the reference translation. BLEU is based on n-gram (unigrams up to 4-grams) precision of the system translation relative to the reference translation and a brevity penalty to penalize too short translations. We report BLEU scaled to 0–100 as is usual in most papers (although BLEU was originally defined as 0–1 by Papineni et al. 28 ); the higher BLEU value, the better translation. We use the SacreBLEU implementation 29 with signature BLEU+case.mixed+lang.en-cs+numrefs.1+smooth.exp+tok.13a.

7 CUBBITT training: hyperparameters

We use the Transformer “big” model from the Tensor2Tensor framework v1.6.0 18 . We followed the training setup and tips of Popel and Bojar 30 and Popel et al. 31 , training our models with the Adafactor optimizer 32 instead of the default Adam optimizer. We use the following hyperparameters: learning_rate_schedule = rsqrt_decay, batch_size = 2900, learning_rate_warmup_steps = 8000, max_length = 150, layer_prepostprocess_dropout = 0, optimizer = Adafactor. For decoding, we use alpha = 1.0, beam_size = 4.

8 CUBBITT training: checkpoint averaging

A popular way of improving the translation quality in NMT is ensembling, where several independent models are trained and during inference (decoding, translation) each target token (word) is chosen according to an averaged probability distribution (using argmax in the case of greedy decoding) and used for further decisions in the autoregressive decoder of each model.

However, ensembling is expensive both in training and inference time. The training time can be decreased by using checkpoint ensembles 33 , where N last checkpoints of a single training run are used instead of N independently trained models. Checkpoint ensembles are usually worse than independent ensembles 33 , but allow to use more models in the ensemble thanks to shorter training time. The inference time can be decreased by using checkpoint averaging, where the weights (learned parameters of the network) in the N last checkpoints are element-wise averaged, creating a single averaged model.

Checkpoint averaging has been first used in NMT by Junczys-Dowmunt et al. 34 , who report that averaging four checkpoints is “not much worse than the actual ensemble” of the same four checkpoints and it is better than ensembles of two checkpoints. Averaging ten checkpoints “even slightly outperforms the real four-model ensemble”.

Checkpoint averaging has been popular in recent NMT systems because it has almost no additional cost (averaging takes only several minutes), the results of averaged models have lower variance in BLEU and are usually at least slightly better than models without averaging 30 .

In our experiments, we store checkpoints each hour and average the last 8 checkpoints.

9 CUBBITT training: Iterated backtranslation

For our initial experiments with backtranslation, we reused an existing CS → EN system UEdin (Nematus software trained by a team from the University of Edinburgh and submitted to WMT 2016 35 ). This system itself was trained using backtranslation. We decided to iterate the backtranslation process further by using our EN → CS Transformer to translate English monolingual data and use that for training a higher quality CS → EN Transformer, which was in turn used for translating Czech monolingual data and training our final EN → CS Transformer system called CUBBITT. Supplementary Fig.  2 illustrates this process and provides details about the training data and backtranslation variants (mix-BT in MT1 and block-BT in MT2–4) used.

Each training we did (MT3–5 in Supplementary Fig. 2) took ca. eight days on a single machine with eight GTX 1080 Ti GPUs. Translating the monolingual data with UEdin2016 (MT0) took ca. two weeks and with our Transformer models (MT1–3) it took ca. 5 days.

10 CUBBITT training: translationese tuning

It has been observed that text translated from language X into Y has different properties (such as lexical choice or syntactic structure) compared to text originally written in language Y 36 . Term translationese is used in translation studies (translatology) for this phenomenon (and sometimes also for the translated language itself).

We noticed that when training on synthetic data, the model performs much better on the WMT13-orig-cs dev set than on the WMT13-orig-en dev set. When trained on authentic data, it is the other way round. Intuitively, this makes sense: The target side of our synthetic data are original Czech sentences from Czech newspapers, similarly to the WMT13-orig-cs dataset. In our authentic parallel data, over 90% of sentences were originally written in English about non-Czech topics and translated into Czech (by human translators), similarly to the WMT13-orig-en dataset. There are two closely related phenomena: a question of domain (topics) in the training data and a question of so-called translationese effect, i.e., which side of the parallel training data (and test data) is the original and which is the translation.

Based on these observations, we prepared an orig-cs-tuned model and an orig-en-tuned model. Both models were trained in the same way; they differ only in the number of training steps. For the orig-cs-tuned model, we selected a checkpoint with the best performance on WMT13-orig-cs (Czech-origin portion of WMT newstest2013), which was at 774k steps. Similarly, for the orig-en-tuned model, we selected the checkpoint with the best performance on WMT13-orig-en, which was at 788k steps. Note that both the models were trained jointly in one experiment, just selecting checkpoints at two different moments. The WMT18-orig-en test-set was translated using the orig-en-tuned model and the WMT18-orig-cs part was translated using the orig-cs-tuned model.

11 CUBBITT training: regex postediting

We applied two simple post-processings to the translations, using regular expressions. First, we converted quotation symbols in the translations to the correct-Czech lower and upper quotes („ and “) using two regexes: s/(ˆ|[({[])(“|,,|”|“)/$1„/g and s/(“|”)($|[,.?!:;)}\]])/“$2/g. Second, we deleted phrases repeated more than twice (immediately following each other); we kept just the first occurrence. We considered phrases of one up to four words. This postprocessing affected less than 1% sentences in the dev set.

12 CUBBITT training: English–French and English–Polish

We trained English→French, French→English, English→Polish and Polish→English versions of CUBBITT, following the abovementioned English–Czech setup, but using the training data described in Supplementary Tables  3 and 5 and the training diagram in Supplementary Fig.  3 . All systems (including M1 and M2) were trained with Tensor2Tensor Transformer (no Nematus was involved). Iterated backtranslation was tried only for French→English. No translationese tuning was used (because we report just the BLEU training curve, but no experiments where the final checkpoint selection is needed). No regex post-diting was used.

13 Reanalysis of context-unaware evaluation in WMT18

We first reanalyzed results from the context-unaware evaluation of WMT 2018 English–Czech News Translation Task, provided to us by the WMT organizers ( http://statmt.org/wmt18/results.html ). The data shown in Fig.  3a were processed in the same way as by the WMT organizers: scores with BAD and REF types were first removed, a grouped score was computed as an average score for every triple language pair (“Pair”), MT system (“SystemID”), and sentence (“SegmentID”) was computed, and the systems were sorted by their average score. In Fig.  3a , we show distribution of the grouped scores for each of the MT systems, using paired two-tailed sign test to compare significance of differences of the subsequent systems.

We next assessed whether the results could be confounded by the original language of the source. Specifically, one half of the test-set sentences in WMT18 were originally English sentences translated to Czech by a professional agency, while the other half were English translations of originally Czech sentences. However, both types of sentences were used together for evaluation of both translation directions in the competition. Since the direction of translation could affect the evaluation, we first re-evaluated the MT systems in WMT18 by splitting the test-set according to the original language in which the source sentences were written.

Although the absolute values of source direct assessment were lower for all systems and reference translation in originally English source sentences compared to originally Czech sentences, CUBBITT significantly outperformed the human reference and other MT systems in both test sets (Supplementary Fig.  4 ). We checked that this was true also when comparing z-score normalized scores and using unpaired one-tail Mann–Whitney U test, as by the WMT organizers.

Any further evaluation in our study was performed only on documents with the source side as the original text, i.e., with originally English sentences in the English→Czech evaluations.

14 Context-aware evaluation: methodology

Three groups of paid evaluators were recruited: six professional translators, three translation theoreticians, and seven other evaluators (non-professionals). All 16 evaluators were native Czech speakers with excellent knowledge of the English language. The professional translators were required to have at least 8 years of professional translation experience and they were contacted via The Union of Interpreters and Translators ( http://www.jtpunion.org/ ). The translation theoreticians were from The Institute of Translation Studies, Charles University’s Faculty of Arts ( https://utrl.ff.cuni.cz/ ). Guidelines presented to the evaluators are given in Supplementary Methods 1.1.

For each source sentence, evaluators compared two translations: Translation T1 (the left column of the annotation interface) vs Translation T2 (the right column of the annotation interface). Within one document (news article), Translation T1 was always a reference and Translation T2 was always CUBBITT, or vice versa (i.e., each column within one document being purely reference translation or purely CUBBITT). However, evaluators did not know which system is which, nor that one of them is a human translation and the other one is a translation by MT system. The order of reference and CUBBITT was random in each document. Each evaluator encountered reference being Translation T1 in approximately one half of the documents.

Evaluators scored 10 consecutive sentences (or the entire document if shorter than 10 sentences) from a random section of the document (the same section was used in both T1 and T2 and by all evaluators scoring this document), but they had access to the source side of the entire document (Supplementary Fig.  5 ).

Every document was scored by at least two evaluators (2.55 ± 0.64 evaluators on average). The documents were assigned to evaluators in such a way that every evaluator scored nine different nonspam documents and most pairs of evaluators had at least one document in common. This maximized the diversity of annotator pairs in the computation of interannotator agreement. In total, 135 (53 unique) documents and 1304 (512 unique) sentences were evaluated by the 15 evaluators who passed quality control (see below).

15 Context-aware evaluation: quality control

The quality control check of evaluators was performed using a spam document, similarly as in Läubli et al. 23 and Kittur et al. 37 . In MT translations of the spam document, the middle words (i.e., except the first and last words in the sentence) were randomly shuffled in each of the middle six sentences of the document (i.e., the first and last two sentences were kept intact). We ascertained that the resulting spam translations made no sense.

The criterion for evaluators to pass the quality control was to score at least 90% of reference sentences better than all spam sentences (in each category: adequacy, fluency, overall). One non-professional evaluator did not pass the quality control, giving three spam sentences a higher score than 10% of the reference sentences. We excluded the evaluator from the analysis of the results (but the key results reported in this study would hold even when including the evaluator).

16 context-aware evaluation: interannotator agreement

We used two methods to compute interannotator agreement (IAA) on the paired scores (CUBBITT—reference difference) in adequacy, fluency, and overall quality for the 15 evaluators. First, for every evaluator, we computed Pearson and Spearman correlation of his/her scores on individual sentences with a consensus of scores from all other evaluators. This consensus was computed for every sentence as the mean of evaluations by other evaluators who scored this sentence. This correlation was significant after Benjamini–Hochberg correction for multiple testing for all evaluators in adequacy and fluency and overall quality. The median and interquartile range of the Spearman r of the 15 evaluators were 0.42 (0.33–0.49) for adequacy, 0.49 (0.35–0.55) for fluency, and 0.49 (0.43–0.54) for overall quality. The median and interquartile range of the Pearson r of the 15 evaluators were 0.42 (0.32–0.49) for adequacy, 0.47 (0.39–0.55) for fluency, and 0.46 (0.40–0.50) for overall quality.

Second, we computed Kappa in the same way as in WMT 2012–2016 38 , separately for adequacy, fluency, and overall quality (Supplementary Table  7 ).

17 Context-aware evaluation: statistical analysis

First, we computed the average score for every sentence from all evaluators who scored the sentence within the group (non-professionals, professionals, translation theoreticians for Fig.  3 and Supplementary Fig.  7B ) or within the entire cohort (for Supplementary Fig.  7A ). The difference between human reference and CUBBITT translations was assessed using paired two-tailed sign test (Matlab function sign test) and P values below 0.05 were considered statistically significant.

In the analysis of relative contribution of adequacy and fluency in the overall score (Supplementary Fig.  6 ), we fitted a linear model through scores in all sentences, separately for human reference translations and CUBBITT translations for every evaluator, using matlab function fitlm(tableScores,‘overall~adequacy+fluency’,‘RobustOpts’,‘on’, ‘Intercept’, false).

18 Context-aware evaluation: analysis of document types

For analysis of document types (Supplementary Fig.  11 ), we grouped the 53 documents (news articles) into seven classes: business (including economics), crime, entertainment (including art, film, one article about architecture), politics, scitech (science and technology), sport, and world. Then we compared the relative difference of human reference minus CUBBITT translation scores on the document-level scores and sentence-level scores and used sign test to assess the difference between the two translations.

19 Evaluation of error types in context-aware evaluation

Three non-professionals and three professional translator evaluators performed a follow-up evaluation of error types, after they finished the basic context-aware evaluation. Nine columns were added into the annotation sheets next to their evaluations of quality (adequacy, fluency, and overall quality) of each of the two translations. The evaluators were asked to classify all translation errors into one of eight error types and to identify sentences with an error due to cross-sentence context (see guidelines). In total, 54 (42 unique) documents and 523 (405 unique) sentences were evaluated by the six evaluators. Guidelines presented to the evaluators are given in Supplementary Methods 1.2.

Similarly to Section 5.4, we compute IAA Kappa scores for each error type, based on the CUBBITT—Reference difference (Supplementary Table  8 ).

When carrying out statistical analysis, we first grouped the scores of sentences with multiple evaluations by computing the average number of errors per sentence and error type from the scores of all evaluators who scored this sentence. Next, we compared the percentage of sentences with at least one error (Fig.  4a ) and the number of errors per sentence (Supplementary Fig.  9 ), using sign test to compare the difference between human reference and CUBITT translations.

20 Evaluation of five MT systems

Five professional-translator evaluators performed this follow-up evaluation after they finished the previous evaluations. For each source sentence, the evaluators compared five translations by five MT systems: Google Translate from 2018, UEdin from 2018, Transformer trained with one iteration of mix-BT (as MT2 in Supplementary Fig. 2, but with mix-BT instead of block-BT), Transformer trained with one iteration of block-BT (MT2 in Supplementary Fig. 2), and the final CUBBITT system. Within one document, the order of the five systems was fixed, but it was randomized between documents. Evaluators were not given any details about the five translations (such as whether they are human or MT translations or by which MT systems). Every evaluator was assigned only documents that he/she has not yet evaluated in the basic quality + error types evaluations. Guidelines presented to the evaluators are given in Supplementary Methods 1.3.

Evaluators scored 10 consecutive sentences (or the entire document if this was shorter than 10 sentences) from a random section of the document (the same for all five translations), but had access to the source side of the entire document. Every evaluator scored nine different documents. In total, 45 (33 unique) documents and 431 (336 unique) sentences were evaluated by the five evaluators.

When measuring interannotator agreement, in addition to reporting IAA Kappa scores for the evaluation of all five systems (as usual in WMT) in Supplementary Table  9 , we also provide IAA Kappa scores for each pair of systems in Supplementary Fig.  12 . This confirms the expectation that a higher interannotator agreement is achieved in comparisons of pairs of systems with a large difference in quality.

When carrying out statistical analysis, we first grouped the scores of sentences with multiple evaluations by computing the fluency and adequacy score per sentence and translation from the scores of all evaluators who scored this sentence. Next, we sorted the MT systems by the mean score, using sign test to compare the difference between the consecutive systems (for Fig.  4b ). Evaluation of the entire test-set (all originally English sentences) using BLEU for comparison is shown in Supplementary Fig.  13 .

21 Translation turing test

Participants of the Translation Turing test were unpaid volunteers. The participants were randomly assigned into four non-overlapping groups: A1, A2, B1, B2. Groups A1 and A2 were presented translations by both human reference and CUBBITT. Groups B1 and B2 were presented translations by both human reference and Google Translate (obtained from https://translate.google.cz/ on 13 August 2018). The source sentences in the four groups were identical. Guidelines presented to the evaluators are given in Supplementary Methods 1.4.

The evaluated sentences were taken from originally English part of the WMT18 evaluation test-set (i.e., WMT18-orig-en) and shuffled in a random order. For each source sentence, it was randomly assigned whether Reference translation will be presented to group A1 or A2; the other group was presented this sentence with the translation by CUBBITT. Similarly, for each source sentence, it was randomly assigned whether Reference translation will be presented to group B1 or B2; the other group was presented this sentence with the translation by Google Translate. Every participant was therefore presented human and machine translations approximately in a 1:1 ratio (but this information was intentionally concealed from them).

Each participant encountered each source sentence at most once (i.e., with only one translation), but each source sentence was evaluated for all the three systems. (Reference was evaluated twice, once in the A groups, once in the B groups.) Each participant was presented with 100 sentences. Only participants with more than 90 sentences evaluated were included in our study.

The Translation Turing test was performed as the first evaluation in this study (but after the WMT18 competition) and participants who overlapped with the evaluators of the context-aware evaluations were not shown results from the Turing test before they finished all the evaluations.

In total, 15 participants evaluated a mix of human and CUBBITT translations (five professional translators, six MT researchers, and four other), 16 participants evaluated a mix of human and Google Translate translations (eight professional translators, five MT researchers, and three other). A total of 3081 sentences were evaluated by all participants of the test.

When measuring interannotator agreement, we computed the IAA Kappas (Supplementary Table  10 ) using our own script, treating the task as a simple binary classification. While in the previous types of evaluations, we computed the IAA Kappa scores using the script from WMT 2016 38 , this was not possible in the Translation Turing test, which does not involve any ranking.

When carrying out statistical analysis, we computed the accuracy for each participant as the percentage of sentences with correctly identified MT or human translations (i.e., number of true positives + true negatives divided by the number of scored sentences) and the significance was assessed using the Fisher test on the contingency table. The resulting P -values were corrected for multiple testing with the Benjamini–Hochberg method using matlab function fdr_bh(pValues,0.05,‘dep’,‘yes’) 39 and participants with the resulting Q -value below 0.05 were considered to have significantly distinguished between human and machine translations.

22 Block-BT and checkpoint averaging synergy

In this analysis, the four systems from Fig.  2a were compared: block-BT vs mix-BT, both with (Avg) vs without (noAvg) checkpoint averaging. All four systems were trained with a single iteration of backtranslation only, i.e., corresponding to the MT2 system in Supplementary Fig.  2 . The WMT13 newstest (3000 sentences) was used to evaluate two properties of the systems over time: translation diversity and generation of novel translations by checkpoint averaging. These properties were analyzed over the time of the training (up to 1 million steps), during which checkpoints were saved every hour (up to 214 checkpoints).

23 Overall diversity and novel translation quantification

We first computed the overall diversity as the number of all the different translations produced by the 139 checkpoints between 350,000 and 1,000,000 steps. In particular, for every sentence in WMT13 newstest, the number of unique translations was computed in the hourly checkpoints, separately for block-BT-noAvg and mix-BT-noAvg. Comparing the two systems in every sentence, block-BT-noAvg produced more unique translations in 2334 (78%) sentences; mix-BT-noAvg produced more unique translations in 532 (18%) sentences; and the numbers of unique translations were equal in 134 (4%) sentences.

Next, in the same checkpoints and for every sentence, we compared translations produced by models with and without averaging and computed the number of checkpoints with a novel Avg∞ translation. These are defined as translations that were never produced by the same system without checkpoint averaging (by never we mean in none of the checkpoints between 350,000 and 1,000,000). In total, there were 1801 (60%) sentences with at least one checkpoint with novel Avg∞ translation in block-BT and 949 (32%) in mix-BT. When comparing the number of novel Avg∞ translations in block-BT vs mix-BT in individual sentences, there were 1644 (55%) sentences with more checkpoints with novel Avg∞ translations in block-BT, 184 (6%) in mix-BT, and 1172 (39%) with equal values.

24 Diversity and novel translations over time

First, we evaluated development of translation diversity over time using moving window of octuples of checkpoints in the two systems without checkpoint averaging. In particular, for every checkpoint and every sentence, we computed the number of different unique translations in the last eight checkpoints. The average across sentences is shown in Supplementary Fig.  16 , separately for block-BT-noAvg and mix-BT-noAvg.

Second, we evaluated development of novel translations by checkpoint averaging over time. In particular, for every checkpoint and every sentence, we evaluated whether the Avg model created a novel Avg8 translation, i.e., whether the translation differed from all the translations of the last eight noAvg checkpoints. The percentage of sentences with a novel Avg8 translation in the given checkpoint is shown in Fig.  8a , separately for block-BT and mix-BT.

25 Effect of novel translations on evaluation by BLEU

We first identified the best model (checkpoint) for each of the systems according to BLEU: checkpoint 775178 in block-BT-Avg (BLEU 28.24), checkpoint 775178 in block-BT-NoAvg (BLEU 27.54), checkpoint 606797 in mix-BT-Avg (BLEU 27.18), and checkpoint 606797 in mix-BT-NoAvg (BLEU 26.92). We note that the Avg and NoAvg systems do not necessarily need to have the same checkpoint with the highest BLEU, however it was nevertheless the case in both block-BT and mix-BT systems here. We next identified which translations in block-BT-Avg and in mix-BT-Avg were novel Avg8 (i.e., not seen in the last eight NoAvg checkpoints). There were 988 novel Avg8 sentences in block-BT-Avg and 369 in mix-BT-Avg. Finally, we computed BLEU of Avg translations, in which either the novel Avg8 translations were replaced with the NoAvg versions (yellow bars in Fig.  8b ), or vice versa (orange bars in Fig.  8b ); separately for block-BT and mix-BT.

Reporting summary

Further information on research design is available in the  Nature Research Reporting Summary linked to this article.

Data availability

Data used for comparison of human and machine translations may be downloaded at http://hdl.handle.net/11234/1-3209 .

Code availability

The CUBBITT source code is available at https://github.com/tensorflow/tensor2tensor . Codes for analysis of human and machine translations were uploaded together with the analyzed data at http://hdl.handle.net/11234/1-3209 .

Hirschberg, J. & Manning, C. D. Advances in natural language processing. Science 349 , 261–266 (2015).

Article   ADS   MathSciNet   CAS   Google Scholar  

Bojar, O. Machine Translation. In Oxford Handbooks in Linguistics 323–347 (Oxford University Press, 2015).

Hajič, J. et al. Natural Language Generation in the Context of Machine Translation  (Center for Language and Speech Processing, Johns Hopkins University, 2004).

Vanmassenhove, E., Hardmeier, C. & Way, A. Getting gender right in neural machine translation. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing 3003–3008 (Association for Computational Linguistics, 2018).

Artetxe, M., Labaka, G., Agirre, E. & Cho, K. Unsupervised neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018—Conference Track Proceedings (2018).

Lecun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521 , 436–444 (2015).

Article   ADS   CAS   Google Scholar  

Moravčík, M. et al. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science 356 , 508–513 (2017).

Article   ADS   MathSciNet   Google Scholar  

Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27 (NIPS 2014) 3104–3112 (Curran Associates, Inc., 2014).

Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (2014).

Luong, M.-T. & Manning, C. D. Stanford neural machine translation systems for spoken language domains. In Proceedings of The International Workshop on Spoken Language Translation (IWSLT) (2015).

Junczys-Dowmunt, M., Dwojak, T. & Hoang, H. Is neural machine translation ready for deployment? A case study on 30 translation directions. In Proceedings of the Ninth International Workshop on Spoken Language Translation (IWSLT) (2016).

Hutchins, W. J. & Somers, H. L. An introduction to machine translation . (Academic Press, 1992).

Brown, P. F., Della Pietra, S. A., Della Pietra, V. J. & Mercer, R. L. The mathematics of statistical machine translation. Comput. Linguist 19 , 263–311 (1993).

Google Scholar  

Koehn, P., Och, F. J. & Marcu, D. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - NAACL ’03 1, 48–54 (2003).

Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at http://arxiv.org/abs/1609.08144 (2016).

Hassan, H. et al. Achieving human parity on automatic Chinese to English news translation. Preprint at http://arxiv.org/abs/1803.05567 (2018).

Bojar, O. et al. Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third Conference on Machine Translation (WMT) 2, 272–307 (2018).

Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems (Curran Associates, Inc., 2017).

Sennrich, R., Haddow, B. & Birch, A. Neural machine translation of rare words with subword units. 54th Annu. Meet. Assoc. Comput. Linguist . https://doi.org/10.18653/v1/P16-1162 (2015).

Bojar, O. et al. CzEng 1.6: Enlarged Czech-English parallel corpus with processing tools dockered. In Text, Speech, and Dialogue: 19th International Conference, TSD 2016 231–238 (2016).

Tiedemann, J. OPUS—-parallel corpora for everyone. In Proceedings of the 19th Annual Conference of the European Association of Machine Translation (EAMT) 384 (2016).

Toral, A., Castilho, S., Hu, K. & Way, A. Attaining the unattainable? Reassessing claims of human parity in neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT) 113–123 (2018).

Läubli, S., Sennrich, R. & Volk, M. Has machine translation achieved human parity? A case for document-level evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) 4791–4796 (2018).

Castilho, S. et al. Is neural machine translation the new state of the art? Prague Bull. Math. Linguist 108 , 109–120 (2017).

Article   Google Scholar  

Haddow, B. et al. The University of Edinburgh’s submissions to the WMT18 news translation Task. In Proc. Third Conference on Machine Translation 2, 403–413 (2018).

Barrault, L. et al. Findings of the 2019 conference on machine translation (WMT19). In Proc. Fourth Conference on Machine Translation (WMT) 1–61 (2019).

Bojar, O. et al. The joy of parallelism with CzEng 1.0. In Proc. Eighth International Language Resources and Evaluation Conference (LREC’12) 3921–3928 (2012).

Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. BLEU: a method for automatic evaluation of machine translation. In Proc. 40th Annual Meeting on Association for Computational Linguistics—ACL ’02 311–318 (Association for Computational Linguistics, 2002).

Post, M. A Call for clarity in reporting BLEU scores. In Proc. Third Conference on Machine Translation (WMT) 186–191 (2018).

Popel, M. & Bojar, O. Training tps for the transformer model. Prague Bull. Math. Linguist . https://doi.org/10.2478/pralin-2018-0002 (2018).

Popel, M. CUNI transformer neural MT system for WMT18. In Proc. Third Conference on Machine Translation (WMT) 482–487 (Association for Computational Linguistics, 2019).

Shazeer, N. & Stern, M. Adafactor: Adaptive learning rates with sublinear memory cost. In Proc. 35th International Conference on Machine Learning, ICML 2018 4603–4611 (2018).

Sennrich, R. et al. The University of Edinburgh’s neural MT systems for WMT17. In Proc. Second Conference on Machine Translation (WMT) 2 , 389–399 (2017).

Junczys-Dowmunt, M., Dwojak, T. & Sennrich, R. The AMU-UEDIN submission to the WMT16 news translation task: attention-based NMT models as feature functions in phrase-based SMT. In Proc. First Conference on Machine Translation (WMT) 319–325 (2016).

Sennrich, R., Haddow, B. & Birch, A. Edinburgh neural machine translation systems for WMT 16. In Proc. First Conference on Machine Translation (WMT) 371–376 (2016).

Gellerstam, M. Translationese in Swedish novels translated from English. In Translation Studies in Scandinavia: Proceedings from the Scandinavian Symposium on Translation Theory (SSOTT) 88–95 (1986).

Kittur, A., Chi, E. H. & Suh, B. Crowdsourcing user studies with Mechanical Turk. In Proc. 26th Annual CHI Conference on Human Factors in Computing Systems (CHI ’08) 453–456 (ACM Press, 2008).

Bojar, O. et al. Findings of the 2016 conference on machine translation. In Proc. First Conference on Machine Translation: Volume 2, Shared Task Papers 131–198 (2016).

Groppe, D. fdr_bh MATLAB central file exchange. https://www.mathworks.com/matlabcentral/fileexchange/27418-fdr_bh . (2020).

Download references

Acknowledgements

We thank the volunteers who participated in the Translation Turing test, Jack Toner for consultation of written English, and the WMT 2018 organizers for providing us with the data for the re-evaluation of translation quality. This work has been partially supported by the grants 645452 (QT21) of the European Commission, GX19-26934X (NEUREM3) and GX20-16819X (LUSyD) of the Grant Agency of the Czech Republic. The work has been using language resources developed and distributed by the LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2018101).

Author information

These authors contributed equally: Martin Popel, Marketa Tomkova, Jakub Tomek.

Authors and Affiliations

Faculty of Mathematics and Physics, Charles University, Prague, 121 16, Czech Republic

Martin Popel, Ondřej Bojar & Zdeněk Žabokrtský

Ludwig Cancer Research Oxford, University of Oxford, Oxford, OX1 2JD, UK

Marketa Tomkova

Department of Computer Science, University of Oxford, Oxford, OX1 3QD, UK

Jakub Tomek

Google Brain, Mountain View, California, CA, 94043, USA

Łukasz Kaiser & Jakob Uszkoreit

You can also search for this author in PubMed   Google Scholar

Contributions

M.P. initiated the project. L.K. and J.U. designed and implemented the Transformer model. M.P. designed and implemented training of the translation system. J.T., M.T., and M.P. with contributions from O.B. and Z.Ž. designed the evaluation. M.T., J.T., and M.P. conducted the evaluation. M.T. and J.T. analyzed the results. M.T., J.T., and M.P. wrote the initial draft; all other authors critically reviewed and edited the manuscript.

Corresponding author

Correspondence to Martin Popel .

Ethics declarations

Competing interests.

J.U. and L.K. are employed by and hold equity in Google, which funded the development of Transformer. The remaining authors (M.P., M.T., J.T., O.B., Z.Ž.) declare no competing interests.

Additional information

Peer review information Nature Communications thanks Alexandra Birch and Marcin Junczys-Dowmunt for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, description of additional supplementary files, supplementary data 1, reporting summary, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Popel, M., Tomkova, M., Tomek, J. et al. Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. Nat Commun 11 , 4381 (2020). https://doi.org/10.1038/s41467-020-18073-9

Download citation

Received : 16 September 2019

Accepted : 24 July 2020

Published : 01 September 2020

DOI : https://doi.org/10.1038/s41467-020-18073-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Forward layer-wise learning of convolutional neural networks through separation index maximizing.

  • Ahmad Kalhor
  • Melika Sadeghi Tabrizi

Scientific Reports (2024)

A neural network to identify requests, decisions, and arguments in court rulings on custody

  • José Félix Muñoz-Soro
  • Rafael del Hoyo Alonso
  • Francisco Lacueva

Artificial Intelligence and Law (2024)

A novel two-way rebalancing strategy for identifying carbonylation sites

  • Linjun Chen
  • Xiao-Yuan Jing

BMC Bioinformatics (2023)

Predicting corrosion inhibition efficiencies of small organic molecules using data-driven techniques

  • Bahram Vaghefinazari
  • Christian Feiler

npj Materials Degradation (2023)

Design of intelligent module design for humanoid translation robot by combining the deep learning with blockchain technology

Scientific Reports (2023)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

thesis machine translation evaluation

Advertisement

Issue Cover

  • Previous Article
  • Next Article

1 Introduction

2 related work, 3 human evaluation methodologies, 5 conclusion, acknowledgments, experts, errors, and context: a large-scale study of human evaluation for machine translation.

Action Editor: Alexandra Birch

  • Cite Icon Cite
  • Open the PDF for in another window
  • Permissions
  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Search Site

Markus Freitag , George Foster , David Grangier , Viresh Ratnakar , Qijun Tan , Wolfgang Macherey; Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics 2021; 9 1460–1474. doi: https://doi.org/10.1162/tacl_a_00437

Download citation file:

  • Ris (Zotero)
  • Reference Manager

Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research.

Like many natural language generation tasks, machine translation (MT) is difficult to evaluate because the set of correct answers for each input is large and usually unknown. This limits the accuracy of automatic metrics, and necessitates costly human evaluation to provide a reliable gold standard for measuring MT quality and progress. Yet even human evaluation is problematic. For instance, we often wish to decide which of two translations is better, and by how much, but what should this take into account? If one translation sounds somewhat more natural than another, but contains a slight inaccuracy, what is the best way to quantify this? To what extent will different raters agree on their assessments?

The complexities of evaluating translations— both machine and human—have been extensively studied, and there are many recommended best practices. However, due to expedience, human evaluation of MT is frequently carried out on isolated sentences by inexperienced raters with the aim of assigning a single score or ranking. When MT quality is poor, this can provide a useful signal; but as quality improves, there is a risk that the signal will become lost in rater noise or bias. Recent papers have argued that poor human evaluation practices have led to misleading results, including erroneous claims that MT has achieved human parity (Toral, 2020 ; Läubli et al., 2018 ).

Our key insight in this paper is that any scoring or ranking of translations is implicitly based on an identification of errors and other imperfections. Asking raters for a single score forces them to synthesize this complex information, and can lead to rushed judgments based on partial analyses. Furthermore, the implicit weights assigned by raters to different types of errors may not match their importance in the current application. An explicit error listing contains all necessary information for judging translation quality, and can thus be seen as a “platinum standard” for other human evaluation methodologies. This insight is not new: It is the conceptual basis for the Multidimensional Quality Metrics (MQM) framework developed in the EU QTLaunchPad and QT21 projects ( www.qt21.eu ), which we endorse and adopt for our experiments. MQM involves explicit error annotation, deriving scores from weights assigned to different errors, and returning an error distribution as additional valuable information.

MQM is a generic framework that provides a hierarchy of translation errors that can be tailored to specific applications. We identified a hierarchy appropriate for broad-coverage MT, and annotated outputs from 10 top-performing “systems” (including human references) for both the English→German (EnDe) and Chinese→English (ZhEn) language directions in the WMT 2020 news translation task (Barrault et al., 2020 ), using professional translators with access to full document context. For comparison purposes, we also collected scalar ratings on a 7-point scale from both professionals and crowd workers.

A proposal for a standard MQM scoring scheme appropriate for broad-coverage MT.

Release of a large-scale human evaluation corpus for 2 methodologies (MQM and pSQM) with annotations for over 100k HT and high-quality-MT segments in two language pairs (EnDe and ZhEn) from WMT 2020. This is by far the largest study of human evaluation results released to the public.

Re-evaluation of the performance of MT systems and automatic metrics on our corpus, showing clear distinctions between HT and MT based on MQM ratings, adding to the evidence against claims of human parity.

Showing that crowd-worker evaluations have low correlation with MQM-based evaluations, calling into question conclusions drawn on the basis of such evaluations.

Demonstration that automatic metrics based on pre-trained embeddings can outperform human crowd workers.

Characterization of current error types in HT and MT, identifying specific MT weaknesses.

The ALPAC report ( 1966 ) defined an evaluation methodology for MT based on “intelligibility” (comprehensibility) and “fidelity” (adequacy). The ARPA MT Initiative (White et al., 1994 ) defined an overall quality score based on “adequacy”, “fluency”, and “comprehension”. The first WMT evaluation campaign (Koehn and Monz, 2006 ) used adequacy and fluency ratings on a 5-point scale acquired from participants as their main metric. Vilar et al. ( 2007 ) proposed a ranking-based evaluation approach, which became the official metric at WMT from 2008 until 2016 (Callison-Burch et al., 2008 ). The ratings were still acquired from the participants of the evaluation campaign. Graham et al. ( 2013 ) compared human assessor consistency levels for judgments collected on a five-point interval-level scale to those collected on a 1–100 continuous scale, using machine translation fluency as a test case. They claim that the use of a continuous scale eliminates individual judge preferences, resulting in higher levels of inter-annotator consistency. Bojar et al. ( 2016 ) came to the conclusion that fluency evaluation is highly correlated to adequacy evaluation. As a consequence of the latter two papers, continuous direct assessment focusing on adequacy has been the official WMT metric since 2017 (Bojar et al., 2017 ). Due to budget constraints, WMT understandably conducts its human evaluation mostly with researchers and/or crowd workers.

Avramidis et al. ( 2012 ) used professional translators to rate MT output on three different tasks: ranking, error classification, and post-editing. Castilho et al. ( 2017 ) found that crowd workers lack knowledge of translation and, compared w professional translators, tend to be more accepting of (subtle) translation errors. Graham et al. ( 2017 ) showed that crowd-worker evaluation has to be filtered to avoid contamination of results through the inclusion of false assessments. The quality of ratings acquired by either researchers or crowd workers has further been questioned by Toral et al. ( 2018 ) and Läubli et al. ( 2020 ). Mathur et al. ( 2020 ) re-evaluated a subset of WMT submissions with professional translators and showed that the resulting rankings changed and were better aligned with automatic scores. Fischer and Läubli ( 2020 ) found that the number of segments with wrong terminology, omissions, and typographical problems for MT output is similar to HT. Fomicheva ( 2017 ) and Bentivogli et al. ( 2018 ) raised the concern that reference-based human evaluation might penalize correct translations that diverge too much from the reference. The literature mostly agrees that source-based rather than reference-based evaluation should be conducted (Läubli et al., 2020 ). The impact of translationese (Koppel and Ordan, 2011 ) on human evaluation of MT has recently received attention (Toral et al., 2018 ; Zhang and Toral, 2019 ; Freitag et al., 2019 ; Graham et al., 2020 ). These papers show that only natural source sentences should be used for human evaluation.

As alternatives to adequacy and fluency, Scarton and Specia ( 2016 ) presented reading comprehension for MT quality evaluation. Forcada et al. ( 2018 ) proposed gap-filling, where certain words are removed from reference translations and readers are asked to fill the gaps left using the machine-translated text as a hint. Popović ( 2020 ) proposed to ask annotators to just label problematic parts of the translations instead of assigning a score.

The Multidimensional Quality Metrics (MQM) framework was developed in the EU QTLaunchPad and QT21 projects (2012–2018) ( http://www.qt21.eu ) to address the shortcomings of previous quality evaluation methods (Lommel et al., 2014 ). MQM provides a generic methodology for assessing translation quality that can be adapted to a wide range of evaluation needs. Klubička et al. ( 2018 ) designed an MQM-compliant error taxonomy for Slavic languages to run a case study for 3 MT systems for English→Croatian. Rei et al. ( 2020 ) used MQM labels to fine-tune COMET for automatic evaluation. Thomson and Reiter ( 2020 ) designed an error annotation schema based on pre-defined error categories for table-to-text tasks.

We compared three human evaluation techniques: the WMT 2020 baseline; ratings on a 7-point Likert-type scale which we refer to as a Scalar Quality Metric (SQM); and evaluations under the MQM framework. We describe these methodologies in the following three sections, deferring concrete experimental details about annotators and data to the subsequent section.

As part of the WMT evaluation campaign (Barrault et al., 2020 ), WMT runs human evaluation of the primary submissions for each language pair. The organizers collect segment-level ratings with document context (SR+DC) on a 0–100 scale using either source-based evaluation with a mix of researchers/translators (for translations out of English) or reference-based evaluation with crowd workers (for translations into English). In addition, WMT conducts rater quality controls to remove ratings from raters that are not trustworthy. In general, for each system, only a subset of documents receive ratings, with the rated subset differing across systems. The organizers provide two different segment-level scores, averaged across one or more raters: (a) the raw score; and (b) a z-score which is standardized for each annotator. Document- and system-level scores are averages over segment-level scores. For more details, we refer the reader to the WMT findings papers.

Similar to the WMT setting, the Scalar Quality Metric (SQM) evaluation collects segment-level scalar ratings with document context. This evaluation presents each source segment and translated segment from a document in a table row, asking the rater to pick a rating from 0 through 6. The rater can scroll up or down to see all the other source/translation segments from the document. Our SQM experiments used the 0–6 rating scale described above, instead of the wider, continuous scale recommended by Graham et al. ( 2013 ), as this scale has been an established part of our existing MT evaluation ecosystem. It is possible that system rankings may be slightly sensitive to this nuance, but less so with raters who are translators rather than crowd workers, we believe.

To adapt the generic MQM framework for our context, we followed the official guidelines for scientific research ( http://qt21.eu/downloads/MQM-usage-guidelines.pdf ). Our annotators were instructed to identify all errors within each segment in a document, paying particular attention to document context; see Table 1 for complete annotator guidelines. Each error was highlighted in the text, and labeled with an error category from Table 2 , and a severity. To temper the effect of long segments, we imposed a maximum of five errors per segment, instructing raters to choose the five most severe errors for segments containing more errors. Segments that are too badly garbled to permit reliable identification of individual errors are assigned a special Non-translation error.

MQM annotator guidelines.

MQM hierarchy.

Error severities are assigned independent of category, and consist of Major , Minor , and Neutral levels, corresponding, respectively, to actual translation or grammatical errors, smaller imperfections, and purely subjective opinions about the translation. Many MQM schemes include an additional Critical severity which is worse than Major, but we dropped this because its definition is often context-specific. We felt that for broad coverage MT, the distinction between Major and Critical was likely to be highly subjective, while Major errors (true errors) would be easier to distinguish from Minor ones (imperfections).

Since we are ultimately interested in scoring segments, we require a weighting on error types. We fixed the weight on Minor errors at 1, and considered a range of Major weights from 1 to 10 (the Major weight suggested in the MQM standard). We also considered special weighting for Minor Fluency/Punctuation errors. These occur frequently and often involve non-linguistic phenomena such as the spacing around punctuation or the style of quotation marks. For example, in German, the opening quotation mark is below rather than above and some MT systems systematically use the wrong quotation marks. Since such errors are easy to correct algorithmically and do not affect the understanding of the sentence, we wanted to ensure that their role would be to distinguish among systems that are equivalent in other respects. Major Fluency/Punctuation errors that make a text ungrammatical or change its meaning (e.g., eliding the comma in Let’s eat, grandma ) are unaffected by this and have the same weight as other Major errors. Finally, to ensure a well-defined maximum score, we set the weight on the singleton Non-Translation category to be the same as five Major errors (the maximum number permitted).

For each weight combination subject to the above constraints, we examined the stability of system ranking using a resampling technique: Draw 10k alternative test sets by sampling segments with replacement, and count the proportion of resulting system rankings that match the ranking obtained from the full original test set. Table 3 shows representative results. We found that a Major, Minor, Fluency/Punctuation assignment of 5, 1, 0.1 gave the best combined stability across both language pairs while additionally matching the system-level SQM rankings from professional translators ( = pSQM column in the table). Table 4 summarizes this weighting scheme, in which segment-level scores can range from 0 (perfect) to 25 (worst). The final segment-level score is an average over scores from all annotators.

MQM ranking stability for different weights.

MQM error weighting.

3.4 Experimental Setup

We annotated the WMT 2020 English→German and Chinese→English test sets, comprising 1418 segments (130 documents) and 2000 segments (155 documents), respectively. For each set we chose 10 “systems” for annotation, including the three reference translations available for English→German and the two references available for Chinese→English. The MT outputs included all top-performing systems according to the WMT human evaluation, augmented with systems we selected to increase diversity. Table 6 lists all evaluated systems.

Table 5 summarizes rating information for the WMT evaluation and for our additional evaluations: SQM with crowd workers (cSQM), SQM with professional translators (pSQM), and MQM. We used disjoint professional translator pools for pSQM and MQM in order to avoid bias. All members of our rater pools were native speakers of the target language. Note that the average number of ratings per segment is less than 1 for the WMT evaluations because not all ratings surpassed the quality control implemented by WMT. For cSQM, we assess the quality of the raters based on a proficiency test prior to launching a human evaluation. This results in a rater pool similar in quality to WMT, while ensuring three ratings for each document. Interestingly, the expense for cSQM and pSQM ratings were similar. MQM was 3 times more expensive than both SQM evaluations.

To ensure maximum diversity in ratings for pSQM and MQM, we assigned documents in round-robin fashion to all 20 different sets of 3 raters from these pools. We chose an assignment order that roughly balanced the number of documents and segments per rater. Each rater was assigned a subset of documents, and annotated outputs from all 10 systems for those documents. Both documents and systems were anonymized and presented in a different random order to each rater. The number of segments per rater ranged from 6,830–7,220 for English→German and from 9,860–10,210 for Chinese→English.

Details of all human evaluations.

4.1 Overall System Rankings

For each human evaluation setup, we calculate a system-level score by averaging the segment-level scores for each system. Results are summarized in Table 6 . The system- and segment-level correlations to our platinum MQM ratings are shown in Figures 1 and 2 (English→German), and Figures 3 and 4 (Chinese→English). Segment- level correlations are calculated only for segments that were evaluated by WMT. For both language pairs, we observe similar patterns when looking at the results of the different human evaluations, and come to the following findings:

Human evaluations for 10 submissions of the WMT20 evaluation campaign. Horizontal lines separate clusters in which no system is significantly outperformed by another in MQM rating according to the Wilcoxon rank-sum test used to assess system rankings in WMT20.

English→German: System correlation with the platinum ratings acquired with MQM.

English→German: System correlation with the platinum ratings acquired with MQM.

English→German: Segment-level correlation with the platinum ratings acquired with MQM.

English→German: Segment-level correlation with the platinum ratings acquired with MQM.

Chinese→English: System-level correlation with the platinum ratings acquired with MQM.

Chinese→English: System-level correlation with the platinum ratings acquired with MQM.

Chinese→English: Segment-level correlation with the platinum ratings acquired with MQM.

Chinese→English: Segment-level correlation with the platinum ratings acquired with MQM.

(i) Human Translations Are Underestimated by Crowd Workers:

Already in 2016, Hassan et al. ( 2018 ) claimed human parity for news-translation for Chinese→English. We confirm the findings of Toral et al. ( 2018 ); Läubli et al. ( 2018 ) that when human evaluation is conducted correctly, professional translators can discriminate between human and machine translations. All human translations are ranked first by both the pSQM and MQM evaluations for both language pairs. The gap between human translations and MT is even more visible when looking at the MQM ratings, which set the human translations first by a statistically-significant margin, demonstrating that the quality difference between MT and human translation is still large. 3 Another interesting observation is the ranking of Human-P for English→German. Human-P is a reference translation generated using the paraphrasing method of (Freitag et al., 2020 ) which asked linguists to paraphrase existing reference translations as much as possible while also suggesting using synonyms and different sentence structures. Our results support the assumption that crowd workers are biased to prefer literal, easy-to-rate translations and rank Human-P low. Professional translators on the other hand are able to see the correctness of the paraphrased translations and ranked them higher than any MT output. Similar to the standard human translations, the gap between Human-P and the MT systems is larger when looking at the MQM ratings. In MQM, raters have to justify their ratings by labeling the error spans which helps to avoid penalizing non-literal translations.

(ii) WMT Has Low Correlation with MQM:

The human evaluation in WMT was conducted by crowd workers (Chinese→English) or a mix of researchers/translators (English→German) during the WMT evaluation campaign. Further, different FROM all other evaluations in this paper, WMT conducted a reference-based/monolingual human evaluation for Chinese→English in which the machine translation output was compared to a human-generated reference. When comparing the system ranks based on WMT for both language pairs with the ones generated by MQM, we can see low correlation for English→German (see Figure 1 ) and even negative correlation for Chinese→English (see Figure 3 ). We also see very low segment-level correlation for both language pairs (see Figure 2 and Figure 4 ). Later, we will also show that the correlation of SOTA automatic metrics are higher than the human ratings generated by WMT. The results question the reliability of the human ratings acquired by WMT.

(iii) pSQM Has High System-Level Correlation with MQM:

The results for both language pairs suggest that pSQM and MQM are of similar quality as their system rankings mostly agree. Nevertheless, when zooming into the segment-level correlations, we observe a much lower correlation of ∼0.5 based on Kendall tau for both language pairs. The difference in the two approaches is also visible in the absolute differences of the individual systems. For instance, the submissions of DiDi_NLP and Tencent_Translation for Chinese→English are close for pSQM (only 0.04 absolute difference). MQM on the other hand shows a larger difference of 0.19 points. When the quality of two systems gets closer, a more fine-grained evaluation schema like MQM is needed. This is also important when doing system development where the difference between two variations for two systems can be minor. Looking into the future when we get closer to human translation quality, MQM will be needed for reliable evaluation. On the other hand, pSQM seems to be sufficient for an evaluation campaign like WMT.

(iv) MQM Results Are Mainly Driven by Major and Accuracy Errors:

In Table 6 , we also show the MQM error scores only based on Major/Minor errors or only based on Fluency or Accuracy errors. Interestingly, the MQM score based on accuracy errors or based on Major errors gives us almost the same rank as the full MQM score. Later in the paper, we will see that the majority of major errors are accuracy errors. This suggests the quality of an MT system is still driven mostly by accuracy errors as most fluency errors are judged minor.

4.2 Error Category Distribution

MQM provides fine-grained error categories grouped under 4 main categories (accuracy, fluency, terminology, and style). The error distribution for all 3 ratings for all 10 systems are shown in Table 7 . The error category Accuracy/ Mistranslation is responsible for the majority of major errors for both language pairs. This suggests that the main problem of MT is still mistranslation of words or phrases. The absolute number of errors is much higher for Chinese→English, which demonstrates that this translation pair is more challenging than English→German.

Category breakdown of MQM scores for human translations (A, B), machine translations (all systems), and some of the best systems. The ratio of system over human scores is in italics. Errors (%) report the fraction of the total error counts in a category, Major (%) report the fraction of major error for each category.

Table 7 decomposes system and human MQM scores per category for English→German. Human translations obtain lower error counts in all categories, except for additions. Human translators might add tokens for fluency or better understanding that are not solely supported by the aligned source sentence, but accurate in the given context. This observation needs further investigation and couldy potentially be an argument for relaxing the source-target alignment during human evaluation. Both systems and humans are mostly penalized by accuracy/mistranslation errors, but systems record 4x more error points in these categories. Similarly, sentences with more than 5 major errors (non-translation) are much more frequent for systems (∼ 28 × the human rate). The best systems are quite different across categories. Tohoku is average in fluency but outstanding in accuracy, eTranslation is excellent in fluency but worse in accuracy, and OPPO ranks between the two other systems in both aspects. Compared to humans, the best systems are mostly penalized for mistranslations and non-translation (badly garbled sentences).

Table 7 shows that the Chinese→English translation task is more difficult than English→German translation, with higher MQM error scores for human translations. Again, humans are performing better than systems across all categories except for additions, omissions and spelling. Many spelling mistakes relate to name formatting and capitalization, which is difficult for this language pair (see name formatting errors). Mistranslation and name formatting are the categories where the systems are penalized the most compared to humans. When comparing systems, the differences between the best systems is less pronounced than for English→German, both in term of aggregate score and per-category counts.

4.3 Document-error Distribution

We calculate document-level scores by averaging the segment level scores of each document. We show the average document scores of all MT systems and all HTs for English→German in Figure 5 . The translation quality of humans is very consistent over all documents and gets an MQM score of around 1, which is equivalent to one minor error. This demonstrates that the translation quality of humans is consistently independent of the underlying source sentence. The distribution of MQM errors for machine translations looks much different. For some documents, MT gets very close to human performance, while for other documents the gap is clearly visible. Interestingly, all MT systems have similar problems with the same subset of documents, suggesting that the quality of MT output depends on the actual input sentence rather than solely on the underlying MT system.

EnDe: Document-level MQM scores.

EnDe: Document-level MQM scores.

The MQM document-level scores for Chinese→English are shown in Figure 6 . The distribution of MQM errors for the MT output looks very similar to the ones for English→German. There are documents that are more challenging for some MT systems than others. Although the document-level scores are mostly lower for human translations, the distribution looks similar to the ones from MT systems. We first suspected that the reference translations were post-edited from MT. This is not the case: These translations originate from professional translators without access to post-editing but with access to CAT tools (mem-source and translation memory). Another possible explanation is the nature of the source sentences. Most sentences come from Chinese government news pages that have a formal style that may be difficult to render in English.

ZhEn: Document-level MQM scores.

ZhEn: Document-level MQM scores.

4.4 Annotator Agreement and Reliability

Our annotations were performed by professional raters with MQM training. All raters were given roughly the same amount of work, with the same number of segments from each system. This setup should result in similar aggregated rater scores.

Table 8(a) reports the scores per rater aggregated over the main error categories for English→German. All raters provide scores within ± 20 % around the mean, with rater 3 being the most severe rater and rater 1 the most permissive. Looking at individual ratings, rater 2 rated fewer errors in accuracy categories but used the Style/Awkward category more for errors outside of fluency/accuracy. Conversely, rater 6 barely used this category. Differences in error rates among raters are not severe but could be reduced with corrections from annotation models (Paun et al., 2018 ) especially when working with larger annotator pools. The rater comparison on Chinese→English in Table 8(b) reports a wider range of scores than for English→German. All raters provide scores within ± 30 % around the mean. This difference might be due to the greater difficulty of the translation task itself introducing more ambiguity in the labeling. In the future, it would be interesting to compare if translation between languages of different families suffer larger annotator disagreement for MQM ratings.

MQM per rater and category. The ratio of a rater score over the average score is in italics.

In addition to characterizing individual rater performances relative to the mean, we also directly measured their pairwise agreement. It is not obvious how best to do this, since MQM annotations are variable-length lists of two-dimensional items (category and severity). Klubička et al. ( 2018 ) use binary agreements over all possible categories for each segment, but do not consider severity. To reflect our weighting scheme and to enable direct comparison to pSQM scores, we grouped MQM scores from each rater into seven bins with right boundaries 0,5,10,15,20,24.99,25, 4 and measured agreement among the bins. Table 9 shows average, minimum, and maximum pairwise rater agreements for MQM and pSQM ratings. The agreements for MQM are significantly better than the corresponding agreements for pSQM, across both language pairs. Basing scores on explicit error annotations seems to provide a measurable boost in rater reliability.

Pairwise inter-rater agreement.

4.5 Impact on Automatic Evaluation

We compared the performance of automatic metrics submitted to the WMT20 Metrics Task when gold scores came from the original WMT ratings to the performance when gold scores were derived from our MQM ratings. Figure 7 shows Kendall’s tau correlation for selected metrics at the system level. 5 As would be expected from the low correlation between MQM and WMT scores, the ranking of metrics changes completely under MQM. In general, metrics that are not solely based on surface characteristics do somewhat better, though this pattern is not consistent (for example, chrF (Popović, 2015 ) has a high correlation of 0.8 for EnDe). Metrics tend to correlate better with MQM than they do with WMT, and almost all achieve better MQM correlation than WMT does (horizontal dotted line).

System-level metric performance with MQM and WMT scoring for: (a) EnDe, top panel; and (b) ZhEn, bottom panel. The horizontal blue line indicates the correlation between MQM and WMT human scores.

System-level metric performance with MQM and WMT scoring for: (a) EnDe, top panel; and (b) ZhEn, bottom panel. The horizontal blue line indicates the correlation between MQM and WMT human scores.

Table 10 shows average correlations with WMT and MQM gold scores for different granularities. At the system level, correlations are higher for MQM than WMT, and for EnDe than ZhEn. Correlations to MQM are quite good, though on average they are statistically significant only for EnDe. Interestingly, the average performance of baseline metrics is similar to the global average for all metrics in all conditions except for ZhEn WMT, where it is substantially better. Adding human translations to the outputs scored by the metrics results in a large drop in performance, especially for MQM, due to human outputs being rated unambiguously higher than MT by MQM. Segment-level correlations are generally much lower than system-level, though they are significant due to having greater support. MQM correlations are again higher than WMT at this granularity, and are higher for ZhEn than EnDe, reversing the pattern from system-level results and suggesting a potential for improved system-level metric performance through better aggregation of segment-level scores.

Average correlations for metrics at different granularities (using negative MQM scores to obtain positive correlations). The baselines only result averages over BLEU, sentBLEU, TER, chrF, and chrF++; other results average over all metrics available for the given condition. The +human results include reference translations among outputs to be scored. Numbers in italics are average p-values from two-tailed tests, indicating the probability that the observed correlation was due to chance.

We proposed a standard MQM scoring scheme appropriate for broad-coverage, high-quality MT, and used it to acquire ratings by professional translators for Chinese→English and English→German data from the recent WMT 2020 evaluation campaign. These ratings served as a platinum standard for various comparisons to simpler evaluation methodologies, including crowd worker evaluations. We release all data acquired in our study to encourage further research into both human and automatic evaluation.

Our study shows that crowd-worker human evaluations (as conducted by WMT) have low correlation with MQM scores, resulting in substantially different system-level rankings. This finding casts doubt on previous conclusions made on the basis of crowd-worker human evaluation, especially for high-quality MT. We further show that many automatic metrics, and in particular embedding-based ones, already outperform crowd-worker human evaluation. Unlike ratings acquired by crowd-worker and ratings acquired by professional translators with simpler human evaluation methodologies, MQM labels acquired with professional translators show a large gap between the quality of human and machine generated translations. This demonstrates that professionally generated human translations still outperform machine generated translations. Furthermore, we characterize the current error types in human and machine translations, highlighting which error types are responsible for the difference between the two. We hope that researchers will use this as motivation to establish more error-type specific research directions.

We would like to thank Isaac Caswell for first suggesting the use of MQM, Mengmeng Niu for helping run and babysit the experiments, Rebecca Knowles for help with WMT significance testing, Yvette Graham for helping reproduce some of the WMT experiments, and Macduff Hughes for giving us the opportunity to do this study. The authors would also like to thank the anonymous reviewers and the Action Editor of TACL for their constructive reviews.

https://github.com/google/wmt-mqm-human-evaluation .

https://github.com/google-research/google-research/tree/master/mqm_viewer .

In general, MQM ratings induce twice as many statistically significant differences between systems as do WMT ratings (Barrault et al., 2020 ), for both language pairs.

The pattern of document assignments to rater pairs (though not the identities of raters) is the same for our MQM and pSQM ratings, making agreement statistics comparable.

The official WMT system-level results use Pearson correlation, but since we are rating fewer systems (only 7 in the case of EnDe), Kendall is more meaningful; it also corresponds more directly to the main use case of system ranking.

Author notes

Email alerts, related articles, related book chapters, affiliations.

  • Online ISSN 2307-387X

A product of The MIT Press

Mit press direct.

  • About MIT Press Direct

Information

  • Accessibility
  • For Authors
  • For Customers
  • For Librarians
  • Direct to Open
  • Open Access
  • Media Inquiries
  • Rights and Permissions
  • For Advertisers
  • About the MIT Press
  • The MIT Press Reader
  • MIT Press Blog
  • Seasonal Catalogs
  • MIT Press Home
  • Give to the MIT Press
  • Direct Service Desk
  • Terms of Use
  • Privacy Statement
  • Crossref Member
  • COUNTER Member  
  • The MIT Press colophon is registered in the U.S. Patent and Trademark Office

This Feature Is Available To Subscribers Only

Sign In or Create an Account

Machine Translation Evaluation: The Ultimate Guide

thesis machine translation evaluation

Say you’re a business that has decided to invest in a machine translation system. You’ve done some basic research, and find that there are so many options to choose from. Each one claims to score a certain amount based on certain metrics, but you don’t know what the numbers really mean. How do you know which one is the best fit for you?

You need to understand how machine translation evaluation works.

This article will go in-depth on the topic of machine translation evaluation. It will help you understand what it is, why you need it, and the different types of evaluation, to help you make a well-informed decision when choosing an MT system to invest in.

Introduction: What is machine translation evaluation?

Machine translation evaluation refers to the different processes of measuring the performance of a machine translation system.

It’s a way of scoring the quality of MT so that it’s possible to know how good the system is, and there's a solid basis to compare how effective different MT systems are. To do this, machine translation evaluation makes use of quantifiable metrics.

Why are machine translation evaluation metrics important?

There are two main reasons why evaluating the performance of an MT system needs to be done. The first is to check if it’s good enough for real-world application. The second is to serve as a guide in research and development.

To check if it’s good enough for real-world application

First, of course, is to determine whether the MT system works at a level that is good enough for actual use. This is the reason that is of most direct relevance to end users. If the machine translation system performs poorly, users are more likely to choose something else.

Industrial sectors that use MT would also want concrete metrics for deciding what MT system to get. After all, MT is an investment, and businesses need to get the best value for their money.

As such, MT developers need to evaluate whether the machine translation system’s quality is good enough for them to send it out to clients.

To serve as a guide in research and development

MT systems are, ideally, not a static entity. The technology for MT is continually improving over time. It makes sense that MT systems should be expected to improve as well.

This is where research comes in, and researchers need to have some guide on where to look. Measurable metrics allow researchers to compare whether a particular approach is better than another, helping them to fine-tune the system.

This is especially good for seeing how the system deals with consistent translation errors. Having measurable metrics can show in a more controlled setting whether or not a particular approach is able to deal with these kinds of errors.

How do you evaluate the success of machine translation?

There are two different ways to determine how well an MT system performs. Human evaluation is done by human experts doing manual assessment, while automatic evaluation uses AI-based metrics specially developed for assessing translation quality without human intervention. Each has its own advantages and disadvantages. We’ll go into further detail on both kinds of MT evaluation in the later sections of this article, but first, here’s a quick overview of the two types of machine translation evaluation, as well as the approaches toward MT evaluation that make use of them.

Human Evaluation vs Automatic Evaluation

Human evaluation of machine translation means that the assessment of translation quality is done by human professional translators. This is the most effective option when it comes to determining the quality of machine translations down to the level of sentences. But human evaluation, as with human translation, is by nature more costly and time consuming.

Automatic evaluation, on the other hand, uses programs built specifically to assess the quality of machine translation according to different methods. It’s not as reliable as human evaluation on the sentence level, but is a good scalable option when evaluating the overall quality of translation on multiple documents.

Approaches toward MT evaluation

The approaches toward machine translation evaluation are based on the concept of granularity. That is, the different levels at which the scoring might be considered significant.

Sentence-based approach. Under this approach, each sentence is given a score saying whether its translation is good (1) or not good (0) and the total is given an average. This is most commonly done in human evaluation.

Document-based approach. Also known as the corpus-based approach, sentences are also given scores but the significant score is the total or average among a larger set of documents. This is the smallest level at which automated MT evaluation can be considered significant, as it depends heavily on statistics from a wide dataset. 

Context-based approach. This approach differs from the previous ones as what it takes into account is how well the overall MT task suits the purposes to which it's put, rather than through average scores based on sentences. As such, it might be considered a holistic approach to MT evaluation.

Challenges in machine translation evaluation

Machine translation evaluation is a difficult process. This is because language itself is a very complex thing.

For one, there can be multiple correct translations. Take, for example, the following sentence:

The quick brown fox jumped over the lazy dog.

An MT system might generate the following translation instead:

The fast brown fox pounced over the indolent dog.

This is a technically correct translation, and in human evaluation it would normally be marked as such. But in automated evaluation, it would be marked as incorrect.

Small details can also completely change a sentence’s meaning.

The quick brown fox jumped on the lazy dog.

Here, there’s only one word that has been changed. But that one word changes the meaning of the sentence completely. Automatic evaluations are likely to mark it higher than the previous example. Human translators are likely to catch the error, but some might consider it correct.

And that’s because language can be subjective. Even human evaluators can differ in their judgments on whether a translation is good or not.

Human evaluation: The gold standard

Now that we’ve gone over the basics, let’s take an in-depth look at the two types of MT evaluation, beginning with human evaluation.

At the most basic level, the goal of machine translation is to translate text from a source language into a target language on a level that humans can understand. As such, humans are the best point of reference for evaluating the quality of machine translation.

Types of human evaluation

There are a number of different ways that human evaluation is done, which we’ll go into now:

Direct Assessment

This is the most simple kind of human evaluation. Machine translation output is scored on the sentence level.

The challenge with direct assessment is that different judges will vary widely in the way that they score. Some may tend to go for the extremes in terms of scoring, marking translations as either very bad or very good. Others may play it more conservatively, marking the same sentences with scores closer to the middle.

Another challenge is, again, subjectivity. In judging whether a sentence is a bad translation or not, evaluators need to make decisions on language that is ambiguous. Going back to the example sentence:

The quick brown fox jumped over the lazy canine .

Here, canine isn’t necessarily wrong, but it isn’t the best fit either. Some evaluators may consider it good enough, while others might flag it as completely wrong. For example, if the scoring is done on a 5-point scale, some translators might mark it a 4, while another might give it only a 2.

These challenges can be offset by employing a larger pool of evaluators, which will allow the scores to be normalized on statistical terms.

Another way to assess machine translation systems through human evaluation is ranking.

In this case, evaluators don’t provide individual scores for sentences, but instead compare among translations from different MT systems. They then decide which one is the best translation, which is second best, and so on.

The advantage of this method over direct assessment is that it immediately provides a direct comparison, as opposed to comparing scores that have been generated over different trials and possibly by different evaluators.

However, it does still suffer from the challenge of subjectivity. Different MT systems are likely to come up with different errors. For example:

The quick green fox jumped over the lazy dog.

Quick brown fox jumped over lazy dog.

The quick brown fox jump over the lazy dog.

Each sentence has a simple error. The first one has a mistranslation. The second omits articles. The third one is missing verb tenses.

Evaluators now need to decide which error is more important than the other, and again, evaluators may have different opinions on the matter.

Post-editing effort

If the user’s purpose for an MT system is to prepare documents for post-editing, there are also ways to evaluate it according to the amount of effort it takes to post-edit.

The fundamental purpose of post-editing is to allow a translator to work faster than if they were to translate a text from scratch. As such, the simplest way to assess an MT system for post-editing is by measuring the time it takes for the translator to correct the machine-translated output.

Another way to measure post-editing effort is by tabulating the number of strokes on the keyboard that it would take to replace the machine-translated text with a human reference translation. This is independent of time constraints, but also does not take into consideration the possibility of multiple correct translations.

Task-based evaluation

Then there’s task-based evaluation which, as the name suggests, assesses an MT system based on how well it's suited to the task at hand. For example, if it's used in a multilingual webinar setting, participants could be asked to rate their experience with a machine-translated transcript. This means that they are rating the success of the MT system as a whole.

The problem with this approach is that it's very open to the introduction of other uncontrolled elements that may affect the rating evaluators give. As such, the use of task-based evaluation is very situational.

General challenges in human evaluation

As you might be able to see, the different types of human evaluation of MT come with their own challenges. There are also some challenges that they share broadly, and these have to do with consistency or agreement.

Inter-annotator agreement

This refers to the consistency of scores between different evaluators. As we mentioned earlier, different evaluators will have varying tendencies in the way they score the same segments of text. Some may score them at extremes or toward the middle. When ranking different MT engines, their opinions can also vary. This is why it’s important to have multiple evaluators, so that the distribution of scores will be normalized.

Intra-annotator agreement

The way a single evaluator scores a text is also a measure of validity. An evaluator might score a sentence as good or bad the first time around, but they might change their mind upon repeating the same test. Having a high measurement of intra-annotator agreement ensures that the chosen evaluator can be considered consistent and reliable.

Automatic evaluation: The scalable option

Human evaluation is considered the gold standard when it comes to evaluating the quality  of machine translation. However, it’s a costly endeavor in terms of effort and time. This is why researchers in the field have developed different means of evaluating MT quality through automated processes.

These processes are designed to approximate how humans will evaluate the MT system. Of course, they are far from perfect at this, but automatic evaluation still has very important use cases.

The main advantage of automatic evaluation over human evaluation is its scalability. It’s much faster to run hundreds of instances of automatic evaluation than even one round of human evaluation. This makes it an ideal solution when making tweaks or optimizing the MT system, which needs quick results.

Challenges in automatic evaluation

Unlike humans, machines aren’t equipped to handle the different nuances of language usage. Automatic evaluation systems are premised on the MT having an exact match with a reference text, and minor differences can have an impact on the final score. These differences can include deviations in morphology, the use of synonyms, and grammatical order.

Anything that can be considered technically or more or less correct by a human evaluator can possibly be penalized in automatic evaluation. Nonetheless, the number of exact matches, especially when considering a large sample of text, is often enough to make automatic evaluation feasible for use.

Automatic evaluation metrics

There are a number of different automatic evaluation metrics available today. Here are some examples of the ones in use:

BLEU (Bilingual Evaluation Understudy)

NIST (from the National Institute of Standards and Technology)

METEOR (Metric for Evaluation of Translation with Explicit Ordering)

LEPOR (Length-Penalty, Precision, n-gram Position Difference Penalty and Recall)

TER (Translation Error Rate)

Each metric works on different algorithms and as such handle the process of automatic evaluation differently. That means that they have different strengths and weaknesses, and differ as to which kinds of errors they give higher or lower penalties to.

BLEU, the most popular metric

Of all the metrics listed above BLEU is the one that is most commonly used. It was one of the first metrics to achieve a high level of correlation with human evaluation, and has spawned many different variations.

How it works is that individual sentences are scored against a set of high quality reference translations. These scores are then averaged, and the resulting number is the final BLEU score for that MT system. This score represents how closely the MT system’s output matches the human reference translation, which is the marker for quality.

The scores are calculated using units called n-grams, which refer to segments of consecutive text. Going back to the earlier sample sentence, for example:

This can be divided into n-grams of different length. A 2-gram, for example, would be “The quick”, “quick brown”, or “brown fox”. A 3-gram would be “The quick brown” or “quick brown fox”. A 4-gram would be “The quick brown fox”. And so on.

It’s a complex mathematical process, but in basic terms BLEU’s algorithm calculates the score by checking for the number of overlaps between n-grams. The calculated score will be between 0 and 1, with 1 representing a completely identical match between the reference and the output sentence. Now take the following variation on the sample sentence:

The fast brown fox jumped over the lazy dog.

All of the n-grams will match except the ones that have the word “fast”. Another example:

The quick brown fox jumped over the dog.

In this example, the word “lazy” is missing, so that also impacts the overlap negatively. In both cases, the BLEU score would still be high, but less than 1.

In practice, not many sentences will show this high level of correlation. As such, BLEU scores become statistically significant only when taken in the context of a large sample of text, or corpora.

There are, of course, other factors that go into calculating the BLEU score, such as penalties for extra words or very short sentences. Other derivative scoring systems have been developed to compensate for its shortcomings, but BLEU remains highly rated and continues to be the most widely used MT evaluation system today.

Final words on MT evaluation

And that covers the basics of machine translation evaluation. As we have shown, assessing an MT system can be done through human evaluation or automatic evaluation. Both processes have their advantages and disadvantages.

Human evaluation is the gold standard in terms of quality, but is expensive and time-consuming. Automatic translation is not as accurate, but it’s quick and scalable. As such, both types have their specific use cases where they shine.

Approaches to Human and Machine Translation Quality Assessment

  • First Online: 14 July 2018

Cite this chapter

thesis machine translation evaluation

  • Sheila Castilho 6 ,
  • Stephen Doherty 7 ,
  • Federico Gaspari 6 , 8 &
  • Joss Moorkens 9  

Part of the book series: Machine Translation: Technologies and Applications ((MATRA,volume 1))

6003 Accesses

25 Citations

1 Altmetric

In both research and practice, translation quality assessment is a complex task involving a range of linguistic and extra-linguistic factors. This chapter provides a critical overview of the established and developing approaches to the definition and measurement of translation quality in human and machine translation workflows across a range of research, educational, and industry scenarios. We intertwine literature from several interrelated disciplines dealing with contemporary translation quality assessment and, while we acknowledge the need for diversity in these approaches, we argue that there are fundamental and widespread issues that remain to be addressed, if we are to consolidate our knowledge and practice of translation quality assessment in increasingly technologised environments across research, teaching, and professional practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

For a comprehensive review on translation theories in relation to quality see Munday ( 2008 ); Pym ( 2010 ); Drugan ( 2013 ); House ( 2015 ).

http://www.astm.org/Standards/F2575.htm

A new ISO proposal was accepted in 2017 and is under development at the time of writing. The new standard is ISO/AWI 21999 “Translation quality assurance and assessment – Models and metrics”. Details are at https://www.iso.org/standard/72345.html

http://www.qt21.eu/

The notion of ‘recall’ intended here is borrowed from cognitive psychology, and it should not be confused with the concept of ‘recall’ (as opposed to ‘precision’) more commonly used to assess natural language processing tasks and, in particular, the performance of MT systems, e.g. with automatic evaluation metrics, which are discussed in more detail in Sect. 4 (for an introduction to the role of precision and recall in automatic MTE metrics, see Koehn 2009 : 222).

See http://www.statmt.org/

The notion of ‘usability’ discussed here is different from that of ‘adequacy’ covered in Sect. 3.1 , as it involves aspects of practical operational validity and effectiveness of the translated content, e.g. whether a set of translated instructions enable a user to correctly operate a device to perform a specific function or achieve a particular objective (say, update the contact list in a mobile phone, adding a new item).

In Daems et al. ( 2015 ), the average number of production units refers to the number of production units of a segment divided by the number of source text words in that segment. The average time per word indicates the total time spent editing a segment, divided by the number of source text words in that segment. The average fixation duration is based on the total fixation duration (in milliseconds) of a segment divided by the number of fixations within that segment. The average number of fixations results from the number of fixations in a segment divided by the number of source text words in that segment. The pause ratio is given by the total time in pauses (in milliseconds) for a segment divided by the total editing time (in milliseconds) for that segment and, finally, the average pause ratio is the average time per pause in a segment divided by the average time per word in a segment.

Abdallah K (2012) Translators in production networks. Reflections on Agency, Quality and Ethics. Dissertation, University of Eastern Finland

Google Scholar  

Adab B (2005) Translating into a second language: can we, should we? In: Anderman G, Rogers M (eds) In and out of English for better, for worse? Multilingual Matters, Clevedon, pp 227–241

Alabau V, Bonk R, Buck C, Carl M, Casacuberta F, García-Martínez M, González J, Koehn P, Leiva L, Mesa-Lao B, Ortiz D, Saint-Amand H, Sanchis G, Tsoukala C (2013) CASMACAT: an open source workbench for advanced computer aided translation. Prague Bull Math Linguist 100(1):101–112

Article   Google Scholar  

Allen J (2003) Post-editing. In: Somers H (ed) Computers and translation: a translator’s guide. John Benjamins, Amsterdam, pp 297–317

Chapter   Google Scholar  

Arnold D, Balkan L, Meijer S, Lee Humphreys R, Sadler L (1994) Machine translation: an introductory guide. Blackwell, Manchester

Aziz W, Sousa SCM, Specia L (2012) PET: a tool for post-editing and assessing machine translation. In: Calzolari N, Choukri K, Declerck T, Doğan MU, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S (eds) Proceedings of the eighth international conference on language resources and evaluation, Istanbul, pp 3982–3987

Baker M (1992) In other words: a coursebook on translation. Routledge, London

Book   Google Scholar  

Björnsson CH (1971) Læsbarhed. København, Gad

Bojar O, Ercegovčević M, Popel M, Zaidan OF (2011) A grain of salt for the WMT manual evaluation. In: Proceedings of the 6th workshop on Statistical Machine Translation, Edinburgh, 30–31 July 2011, pp 1–11

Byrne J (2006) Technical translation: usability strategies for translating technical documentation. Springer, Heidelberg

Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: Proceedings of 11th conference of the European chapter of the association for computational linguistics 2006, Trento, 3–7 April, pp 249–256

Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-)evaluation of machine translation. In: Proceedings of the second workshop on Statistical Machine Translation, Prague, pp 136–158

Callison-Burch C, Koehn P, Monz C, Schroeder J (2009) Findings of the 2009 workshop on Statistical Machine Translation. In: Proceedings of the 4th EACL workshop on Statistical Machine Translation, Athens, 30–31 March 2009, p 1–28

Callison-Burch C, Koehn P, Monz C, Zaidan OF (2011) Findings of the 2011 Workshop on Statistical Machine Translation. In: Proceedings of the 6th Workshop on Statistical Machine Translation, 30–31 July, 2011, Edinburgh, pp 22–64

Campbell S (1998) Translation into the second language. Longman, New York

Canfora C, Ottmann A (2016) Who’s afraid of translation risks? Paper presented at the 8th EST Congress, Aarhus, 15–17 September 2016

Carl M (2012) Translog – II: a program for recording user activity data for empirical reading and writing research. In: Calzolari N, Choukri K, Declerck T, Doğan MU, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S (eds) Proceedings of the eight international conference on language resources and evaluation, Istanbul, 23–25 May 2014, pp 4108–4112

Carl M, Gutermuth S, Hansen-Schirra S (2015) Post-editing machine translation: a usability test for professional translation settings. In: Ferreira A, Schwieter JW (eds) Psycholinguistic and cognitive inquiries into translation and interpreting. John Benjamins, Amsterdam, pp 145–174

Castilho S (2016) Measuring acceptability of machine translated enterprise content. PhD thesis, Dublin City University

Castilho S, O’Brien S (2016) Evaluating the impact of light post-editing on usability. In: Proceedings of the tenth international conference on language resources and evaluation, Portorož, 23–28 May 2016, pp 310–316

Castilho S, O’Brien S, Alves F, O’Brien M (2014) Does post-editing increase usability? A study with Brazilian Portuguese as target language. In: Proceedings of the seventeenth annual conference of the European Association for Machine Translation, Dubrovnik, 16–18 June 2014, pp 183–190

Catford J (1965) A linguistic theory of translation. Oxford University Press, Oxford

Chan YS, Ng HT (2008) MAXSIM: an automatic metric for machine translation evaluation based on maximum similarity. In: Proceedings of the MetricsMATR workshop of AMTA-2008, Honolulu, Hawaii, pp 55–62

Chomsky N (1969) Aspects of the theory of syntax. MIT Press, Cambridge, MA

Coughlin D (2003) Correlating automated and human assessments of machine translation quality. In: Proceedings of the Machine Translation Summit IX, New Orleans, 23–27 September 2003, pp 63–70

Daems J, Vandepitte S, Hartsuiker R, Macken L (2015) The Impact of machine translation error types on post-editing effort indicators. In: Proceedings of the 4th workshop on post-editing technology and practice, Miami, 3 November 2015, pp 31–45

Dale E, Chall JS (1948) A formula for predicting readability: instructions. Educ Res Bull 27(2):37–54

De Almeida G, O’Brien S (2010) Analysing post-editing performance: correlations with years of translation experience. In: Hansen V, Yvon F (eds) Proceedings of the 14th annual conference of the European Association for Machine Translation, St. Raphaël, 27–28 May 2010. Available via: http://www.mt-archive.info/EAMT-2010-Almeida.pdf . Accessed 10 Jan 2017

De Beaugrande R, Dressler W (1981) Introduction to text linguistics. Longman, New York

Debove A, Furlan S, Depraetere I (2011) A contrastive analysis of five automated QA tools (QA distiller. 6.5.8, Xbench 2.8, ErrorSpy 5.0, SDLTrados 2007 QA checker 2.0 and SDLX 2007 SP2 QA check). In: Depraetere I (ed) Perspectives on translation quality. Walter de Gruyter, Berlin, pp 161–192

DePalma D, Kelly N (2009) The business case for machine translation. Common Sense Advisory, Boston

Depraetere I (2010) What counts as useful advice in a university post-editing training context? Report on a case study. In: Proceedings of the 14th annual conference of the European Association for Machine Translation, St. Raphaël, 27–28 May 2010. Available via: http://www.mt-archive.info/EAMT-2010-Depraetere-2.pdf . Accessed 12 May 2017

Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the second international conference on human language technology research, San Diego, pp 138–145

Doherty S (2012) Investigating the effects of controlled language on the reading and comprehension of machine translated texts. PhD dissertation, Dublin City University

Doherty S (2016) The impact of translation technologies on the process and product of translation. Int J Commun 10:947–969

Doherty S (2017) Issues in human and automatic translation quality assessment. In: Kenny D (ed) Human issues in translation technology. Routledge, London, pp 154–178

Doherty S, O’Brien S (2014) Assessing the usability of raw machine translated output: a user-centred study using eye tracking. Int J Hum Comput Interact 30(1):40–51

Doherty S, Gaspari F, Groves D, van Genabith J, Specia L, Burchardt A, Lommel A, Uszkoreit H (2013) Mapping the industry I: findings on translation technologies and quality assessment. Available via: http://www.qt21.eu/launchpad/sites/default/files/QTLP_Survey2i.pdf . Accessed 12 May 2017

Drugan J (2013) Quality in professional translation: assessment and improvement. Bloomsbury, London

Dybkjær L, Bernsen N, Minker W (2004) Evaluation and usability of multimodal spoken language dialogue systems. Speech Comm 43(1):33–54

Federmann C (2012) Appraise: an open-source toolkit for manual evaluation of MT output. Prague Bull Math Linguist 98:25–35

Fields P, Hague D, Koby GS, Lommel A, Melby A (2014) What is quality? A management discipline and the translation industry get acquainted. Revista Tradumàtica 12:404–412. Available via: https://ddd.uab.cat/pub/tradumatica/tradumatica_a2014n12/tradumatica_a2014n12p404.pdf . Accessed 12 May 2017

Flesch R (1948) A new readability yardstick. J Appl Psychol 32(3):221–233

Gaspari F (2004) Online MT services and real users’ needs: an empirical usability evaluation. In: Frederking RE, Taylor KB (eds) Proceedings of AMTA 2004: 6th conference of the Association for Machine Translation in the Americas “Machine translation: from real users to research”. Springer, Berlin, pp 74–85

Gaspari F, Almaghout H, Doherty S (2015) A survey of machine translation competences: insights for translation technology educators and practitioners. Perspect Stud Translatol 23(3):333–358

Giménez J, Màrquez L (2008) A smorgasbord of features for automatic MT evaluation. In: Proceedings of the third workshop on Statistical Machine Translation, Columbus, pp 195–198

Giménez J, Màrquez L, Comelles E, Catellón I, Arranz V (2010) Document-level automatic MT evaluation based on discourse representations. In: Proceedings of the joint fifth workshop on Statistical Machine Translation and MetricsMATR, Uppsala, pp 333–338

Guerberof A (2014) Correlations between productivity and quality when post-editing in a professional context. Mach Transl 28(3–4):165–186

Guzmán F, Joty S, Màrquez L, Nakov P (2014) Using discourse structure improves machine translation evaluation. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics, Baltimore, June 23–25 2014, pp 687–698

Harrison C (1980) Readability in the classroom. Cambridge University Press, Cambridge

Holmes JS (1988) Translated! Papers on literary translation and translation studies. Rodopi, Amsterdam

House J (1997) Translation quality assessment. A model revisited. Gunter Narr, Tübingen

House J (2001) Translation quality assessment: linguistic description versus social evaluation. Meta 46(2):243–257

House J (2015) Translation quality assessment: past and present. Routledge, London

Hovy E, King M, Popescu-Belis A (2002) Principles of context-based machine translation evaluation. Mach Transl 17(1):43–75

International Organization for Standardisation (2002) ISO/TR 16982:2002 ergonomics of human-system interaction—usability methods supporting human centred design. International Organization for Standardisation, Geneva. Available via: http://www.iso.org/iso/catalogue_detail?csnumber=31176 . Accessed 20 May 2017

International Organization for Standardisation (2012) ISO/TS 11669:2012 technical specification: translation projects – general guidance. International Organization for Standardisation, Geneva. Available via: https://www.iso.org/standard/50687.html . Accessed 20 May 2017

Jones MJ (1988) A longitudinal study of the readability of the chairman’s narratives in the corporate reports of a UK company. Account Bus Res 18(72):297–305

Karwacka W (2014) Quality assurance in medical translation. JoSTrans 21:19–34

Kincaid JP, Fishburne RP Jr, Rogers RL, Chissom BS (1975) Derivation of new readability formulas (automated readability index, Fog count and Flesch reading ease formula) for navy enlisted personnel (No. RBR-8-75). Naval Technical Training Command Millington TN Research Branch

Klerke S, Castilho S, Barrett M, Søgaard A (2015) Reading metrics for estimating task efficiency with MT output. In: Proceedings of the sixth workshop on cognitive aspects of computational language learning, Lisbon, 18 September 2015, pp 6–13

Koby GS, Fields P, Hague D, Lommel A, Melby A (2014) Defining translation quality. Revista Tradumàtica 12:413–420. Available via: https://ddd.uab.cat/pub/tradumatica/tradumatica_a2014n12/tradumatica_a2014n12p413.pdf . Accessed 12 May 2017

Koehn P (2009) Statistical machine translation. Cambridge University Press, Cambridge

Koehn P (2010) Enabling monolingual translators: post-editing vs. options. In: Proceedings of human language technologies: the 2010 annual conference of the North American chapter of the ACL, Los Angeles, pp 537–545

Koponen M (2012) Comparing human perceptions of post-editing effort with post-editing operations. In: Proceedings of the seventh workshop on Statistical Machine Translation, Montréal, 7–8 June 2012, pp 181–190

Krings HP (2001) Repairing texts: empirical investigations of machine translation post-editing processes. Kent State University Press, Kent

Kushner S (2013) The freelance translation machine: algorithmic culture and the invisible industry. New Media Soc 15(8):1241–1258

Kussmaul P (1995) Training the translator. John Benjamins, Amsterdam

Labaka G, España-Bonet C, Marquez L, Sarasola K (2014) A hybrid machine translation architecture guided by syntax. Mach Transl 28(2):91–125

Lacruz I, Shreve GM (2014) Pauses and cognitive effort in post-editing. In: O’Brien S, Balling LW, Carl M, Simard M, Specia L (eds) Post-editing of machine translation: processes and applications. Cambridge Scholars Publishing, Newcastle-Upon-Tyne, pp 246–272

Lassen I (2003) Accessibility and acceptability in technical manuals: a survey of style and grammatical metaphor. John Benjamins, Amsterdam

Lauscher S (2000) Translation quality assessment: where can theory and practice meet? Translator 6(2):149–168

Lavie A, Agarwal A (2007) METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the workshop on Statistical Machine Translation, Prague, pp 228–231

Liu C, Dahlmeier D, Ng HT (2011) Better evaluation metrics lead to better machine translation. In: Proceedings of the 2011 conference on empirical methods in natural language processing, Edinburgh, 27–31 July 2011, pp 375–384

Lommel A, Uszkoreit H, Burchardt A (2014) Multidimensional Quality Metrics (MQM): a framework for declaring and describing translation quality metrics. Revista Tradumàtica 12:455–463. Available via: https://ddd.uab.cat/pub/tradumatica/tradumatica_a2014n12/tradumatica_a2014n12p455.pdf . Accessed 12 May 2017

Moorkens J (2017) Under pressure: translation in times of austerity. Perspect Stud Trans Theory Pract 25(3):464–477

Moorkens J, O’Brien S, da Silva IAL, de Lima Fonseca NB, Alves F (2015) Correlations of perceived post-editing effort with measurements of actual effort. Mach Transl 29(3):267–284

Moran J, Lewis D, Saam C (2014) Analysis of post-editing data: a productivity field test using an instrumented CAT tool. In: O’Brien S, Balling LW, Carl M, Simard M, Specia L (eds) Post-editing of machine translation: processes and applications. Cambridge Scholars Publishing, Newcastle-Upon-Tyne, pp 128–169

Muegge U (2015) Do translation standards encourage effective terminology management? Revista Tradumàtica 13:552–560. Available via: https://ddd.uab.cat/pub/tradumatica/tradumatica_a2015n13/tradumatica_a2015n13p552.pdf . Accessed 2 May 2017

Munday J (2008) Introducing translation studies: theories and applications. Routledge, London

Muzii L (2014) The red-pen syndrome. Revista Tradumàtica 12:421–429. Available via: https://ddd.uab.cat/pub/tradumatica/tradumatica_a2014n12/tradumatica_a2014n12p421.pdf . Accessed 30 May 2017

Nida E (1964) Toward a science of translation. Brill, Leiden

Nielsen J (1993) Usability engineering. Morgan Kaufmann, Amsterdam

MATH   Google Scholar  

Nießen S, Och FJ, Leusch G, Ney H (2000) An evaluation tool for machine translation: fast evaluation for MT research. In: Proceedings of the second international conference on language resources and evaluation, Athens, 31 May–2 June 2000, pp 39–45

Nord C (1997) Translating as a purposeful activity. St. Jerome, Manchester

O’Brien S (2011) Towards predicting post-editing productivity. Mach Transl 25(3):197–215

O’Brien S (2012) Towards a dynamic quality evaluation model for translation. JoSTrans 17:55–77

O’Brien S, Roturier J, de Almeida G (2009) Post-editing MT output: views from the researcher, trainer, publisher and practitioner. Paper presented at the Machine Translation Summit XII, Ottawa, 26 August 2009

O’Brien, S, Choudhury R, Van der Meer J, Aranberri Monasterio N (2011) Dynamic quality evaluation framework. Available via: https://goo.gl/eyk3Xf . Accessed 21 May 2017

O’Brien S, Simard M, Specia L (eds) (2012) Workshop on post-editing technology and practice (WPTP 2012). In: Conference of the Association for Machine Translation in the Americas (AMTA 2012), San Diego

O’Brien S, Simard M, Specia L (eds) (2013) Workshop on post-editing technology and practice (WPTP 2013). Machine Translation Summit XIV, Nice

O’Brien S, Balling LW, Carl M, Simard M, Specia L (eds) (2014) Post-editing of machine translation: processes and applications. Cambridge Scholars Publishing, Newcastle-Upon-Tyne

O’Hagan M (2012) The impact of new technologies on translation studies: a technological turn. In: Millán C, Bartrina F (eds) The Routledge handbook of translation studies. Routledge, London, pp 503–518

Owczarzak K, van Genabith J, Way A (2007) Evaluating machine translation with LFG dependencies. Mach Transl 21(2):95–119

Padó S, Cer D, Galley M, Jurafsky D, Manning CD (2009) Measuring machine translation quality as semantic equivalence: a metric based on entailment features. Mach Transl 23(2–3):181–193

Papineni K, Roukos S, Ward T, Zhu W (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on Association for Computational Linguistics, Philadelphia, pp 311–318

Plitt M, Masselot F (2010) A productivity test of Statistical Machine Translation post-editing in a typical localisation context. Prague Bull Math Linguist 93:7–16

Pokorn NK (2005) Challenging the traditional axioms: translation into a non-mother tongue. John Benjamins, Amsterdam

Popović M (2015) ChrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the 10th workshop on Statistical Machine Translation (WMT-15), Lisbon, 17–18 September 2015, pp 392–395

Popović M, Ney H (2009) Syntax-oriented evaluation measures for machine translation output. In: Proceedings of the fourth workshop on Statistical Machine Translation (StatMT ’09), Athens, pp 29–32

Proctor R, Vu K, Salvendy G (2002) Content preparation and management for web design: eliciting, structuring, searching, and displaying information. Int J Hum Comput Interact 14(1):25–92

Pym A (2010) Exploring translation theories. Routledge, Abingdon

Pym A (2015) Translating as risk management. J Pragmat 85:67–80

Ray R, DePalma D, Pielmeier H (2013) The price-quality link. Common Sense Advisory, Boston

Reeder F (2004) Investigation of intelligibility judgments. In: Frederking RE, Taylor KB (eds) Proceedings of the 6th conference of the Association for MT in the Americas, AMTA 2004. Springer, Heidelberg, pp 227–235

Rehm G, Uszkoreit H (2012) META-NET White Paper Series: Europe’s languages in the digital age. Springer, Heidelberg

Reiss K (1971) Möglichkeiten und Grenzen der Übersetzungskritik. Hueber, Munich

Reiss K, Vermeer HJ (1984) Grundlegung einer allgemeinen Translationstheorie. Niemeyer, Tübingen

Roturier J (2006) An investigation into the impact of controlled English rules on the comprehensibility, usefulness, and acceptability of machine-translated technical documentation for French and German users. Dissertation, Dublin City University

Sacher H, Tng T, Loudon G (2001) Beyond translation: approaches to interactive products for Chinese consumers. Int J Hum Comput Interact 13:41–51

Schäffner C (1997) From ‘good’ to ‘functionally appropriate’: assessing translation quality. Curr Issue Lang Soc 4(1):1–5

Secară A (2005) Translation evaluation: a state of the art survey. In: Proceedings of the eCoLoRe/MeLLANGE workshop, Leeds, 21–23 March 2005, pp 39–44

Smith M, Taffler R (1992) Readability and understandability: different measures of the textual complexity of accounting narrative. Account Audit Account J 5(4):84–98

Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of the Association for Machine Translation in the Americas: “Visions for the future of Machine Translation”, Cambridge, 8–12 August 2006, pp 223–231

Somers H, Wild E (2000) Evaluating Machine Translation: The Cloze procedure revisited. Paper presented at Translating and the Computer 22, London

Sousa SC, Aziz W, Specia L (2011) Assessing the post-editing effort for automatic and semi-automatic translations of DVD subtitles. Paper presented at the recent advances in natural language processing workshop, Hissar, pp 97–103

Specia L (2011) Exploiting objective annotations for measuring translation post-editing effort. In: Proceedings of the fifteenth annual conference of the European Association for Machine Translation, Leuven, 30–31 May, pp 73–80

Stewart D (2012) Translating tourist texts into English as a foreign language. Liguori, Napoli

Stewart D (2013) From pro loco to pro globo: translating into English for an international readership. Interpret Transl Train 7(2):217–234

Stymne S, Danielsson H, Bremin S, Hu H, Karlsson J, Lillkull AP, Wester M (2012) Eye-tracking as a tool for machine translation error analysis. In: Calzolari N, Choukri K, Declerck T, Doğan MU, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S (eds) Proceedings of the eighth international conference on language resources and evaluation, Istanbul, 23–25 May 2012, pp 1121–1126

Stymne S, Tiedemann J, Hardmeier C, Nivre J (2013) Statistical machine translation with readability constraints. In: Proceedings of the 19th Nordic conference of computational linguistics, Oslo, 22–24 May 2013, pp 375–386

Suojanen T, Koskinen K, Tuominen T (2015) User-centered translation. Routledge, Abingdon

Tang J (2017) Translating into English as a non-native language: a translator trainer’s perspective. Translator 23(4):388–403

Article   MathSciNet   Google Scholar  

Tatsumi M (2009) Correlation between automatic evaluation metric scores, post-editing speed, and some other factors. In: Proceedings of MT Summit XII, Ottawa, pp 332–339

Toury G (1995) Descriptive translation studies and beyond. John Benjamins, Amsterdam

Turian JP, Shen L, Melamed ID (2003) Evaluation of machine translation and its evaluation. In: Proceedings of MT Summit IX, New Orleans, pp 386–393

Uszkoreit H, Lommel A (2013) Multidimensional quality metrics: a new unified paradigm for human and machine translation quality assessment. Paper presented at Localisation World, London, 12–14 June 2013

Van Slype G (1979) Critical study of methods for evaluating the quality of machine translation. Bureau Marcel van Dijk, Bruxelles

White J, O’Connell T, O’Mara F (1994) The ARPA MT evaluation methodologies: evolution, lessons and future approaches. In: Technology partnerships for crossing the language barrier, Proceedings of the first conference of the Association for Machine Translation in the Americas, Columbia, pp 193–205

Wilks Y (1994) Stone soup and the French room. In: Zampolli A, Calzolari N, Palmer M (eds) Current issues in computational linguistics: in honour of Don Walker. Linguistica Computazionale IX–X:585–594. Reprinted in Ahmad K, Brewster C, Stevenson M (eds) (2007) Words and intelligence I: selected papers by Yorick Wilks. Springer, Heidelberg, pp 255–265

Williams J (2013) Theories of translation. Palgrave Macmillan, Basingstoke

Wong BTM, Kit C (2012) Extending machine translation evaluation metrics with lexical cohesion to document level. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, Jeju Island, 12–14 July 2012, pp 1060–1068

Download references

Acknowledgments

This work has been partly supported by the ADAPT Centre for Digital Content Technology which is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

Author information

Authors and affiliations.

ADAPT Centre/School of Computing, Dublin City University, Dublin, Ireland

Sheila Castilho & Federico Gaspari

School of Humanities and Languages, The University of New South Wales, Sydney, Australia

Stephen Doherty

University for Foreigners “Dante Alighieri” of Reggio Calabria, Reggio Calabria, Italy

Federico Gaspari

ADAPT Centre/School of Applied Language and Intercultural Studies, Dublin City University, Dublin, Ireland

Joss Moorkens

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Sheila Castilho .

Editor information

Editors and affiliations.

Sheila Castilho

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Castilho, S., Doherty, S., Gaspari, F., Moorkens, J. (2018). Approaches to Human and Machine Translation Quality Assessment. In: Moorkens, J., Castilho, S., Gaspari, F., Doherty, S. (eds) Translation Quality Assessment. Machine Translation: Technologies and Applications, vol 1. Springer, Cham. https://doi.org/10.1007/978-3-319-91241-7_2

Download citation

DOI : https://doi.org/10.1007/978-3-319-91241-7_2

Published : 14 July 2018

Publisher Name : Springer, Cham

Print ISBN : 978-3-319-91240-0

Online ISBN : 978-3-319-91241-7

eBook Packages : Computer Science Computer Science (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Bibliography
  • More Referencing guides Blog Automated transliteration Relevant bibliographies by topics
  • Automated transliteration
  • Relevant bibliographies by topics
  • Referencing guides

Customizing Neural Machine Translation Models with NVIDIA NeMo, Part 2

Decorative image of a globe surrounded by people speaking and texting in different languages, with the text Part 2.

In the first post , we walked through the prerequisites for a neural machine translation example from English to Chinese, running the pretrained model with NeMo , and evaluating its performance. In this post, we walk you through curating a custom dataset and fine-tuning the model on that dataset. 

Custom data collection

Custom data collection is crucial in model fine-tuning because it enables a model to adapt to the specific requirement of a particular task or domain. 

For example, if the task is to translate computer science-related technical articles or blog posts from English to Chinese, collecting previously manually translated or reviewed blog post pairs as fine-tuning data is vital. Such articles contain concepts and terminologies that are commonly used in the computer science field but are rare in the pretraining dataset. 

We recommend collecting at least a few thousand high-quality samples. After fine-tuning with these tailored data, the model can perform much better in technical blog translation tasks.

Data preprocessing pipeline for fine-tuning

You need data preprocessing to filter out invalid and redundant data. The NVIDIA NeMo framework contains NVIDIA NeMo Curator for processing corpora used in LLM pretraining. However, the NMT parallel datasets are different from the corpora as they have source and target texts, which require special filtering methods. Fortunately, NeMo offers most of the out-of-the-box functions and scripts for them.

You can introduce a simple data preprocessing pipeline to clean English-Chinese parallel translations:

Language filtering

Length filtering, deduplication.

  • Tokenization and normalization (NeMo model only)
  • Converting to JSONL format (ALMA model only)
  • Splitting the datasets

You can also use additional preprocessing approaches to filter out invalid data. For example, using an existing translator to remove potential incorrect translations. For more information about other data preprocessing methods, see Machine Translation Models .

Original data format

We collected English-Chinese translation pairs in two text files: 

  • en_zh.en stores the English sentences separated by line.
  • en_zh.zh stores the corresponding Chinese translation in each line.

In this section, the data files are retained in the same format after each processing step. 

NeMo provides language ID filtering, which enables filtering out the training dataset data that isn’t in the correct language by using a pretrained language ID classifier from fastText . 

Download the lid.176.bin language classifier model from from the fastText website:

Use the following code for language ID filtering:

en_zh_preprocessed1.en and en_zh_preprocessed1.zh are the valid data pairs retained for the next step. Pairs in en_zh_garbage1.en and en_zh_garbage1.zh are discarded. 

Here are examples of filtered data in en_zh_garbage1.zh not in Chinese:

Length filtering removes sentences that are less than a minimum length or longer than a maximum length in a parallel corpus, as too short or too long translation may be noisy data. It also filters based on the length ratio between target and source sentences.

Before running length filtering, you can compute the length ratio in a certain percentile with the following script, so that you can have insight into the ratio threshold to be filtered:

The length ratio varies on different datasets and the source and target languages as well.

Run the following script to perform length filtering. In this case --ratio 4.6 is used for the maximum length ratio. 

Similarly, en_zh_preprocessed2.en and en_zh_preprocessed2.zh are the pairs sent to the next step.

In this step, you remove any duplicated translation pairs by using the xxhash library.

Use the following Python script for deduplication:

Again, en_zh_preprocessed3.en and en_zh_preprocessed3.zh are output files in this step.

Tokenization and normalization for the NeMo model

For fine-tuning the NeMo model, additional tokenization and normalization is required:

  • Normalization: Standardizes the punctuation in the sentence, such as quotes written in different ways. 
  • Tokenization: Splits the punctuation from its neighboring word by adding a space to avoid the punctuation attached to the word, which is a recommended step in NeMo training.

The script uses different libraries to process different languages. For example, it uses sacremoses for English and Jieba and OpenCC for simplified Chinese.

en_zh_final_nemo.en and en_zh_final_nemo.zh are the dataset files for NeMo fine-tuning.

Converting to JSONL format for the ALMA model

ALMA training requires JSON Lines (JSONL) as raw input data, where each line in the file is a single JSON structure:

You must convert the en_zh_preprocessed3.en and en_zh_preprocessed3.zh parallel translation text files to the JSONL format:

en_zh_final_alma.jsonl is the dataset file for ALMA training. 

Splitting datasets

The final step is to split the dataset into training, validation, and test sets. You could use the train_test_split method from the sklearn.model_selection package to do this. 

After splitting, you have the following files for NeMo fine-tuning, from the original en_zh_final_nemo.en and en_zh_final_nemo.zh:

  • en_zh_final_nemo_train.en
  • en_zh_final_nemo_train.zh
  • en_zh_final_nemo_val.en
  • en_zh_final_nemo_val.zh
  • en_zh_final_nemo_test.en
  • en_zh_final_nemo_test.zh

You also have the following files for ALMA fine-tuning from the original en_zh_final_alma.jsonl :

  • train.zh-en.json
  • valid.zh-en.json
  • test.zh-en.json

The output files are renamed to follow the ALMA fine-tuning convention.

Model fine-tuning

In this section, we demonstrate how to fine-tune NeMo and ALMA models separately.

Fine-tuning the NeMo NMT model

Before fine-tuning, ensure that you followed the instructions in the NMT Evaluation section and downloaded the NeMo pretrained model to /model/pretrained_ckpt/en_zh_24x6.nemo . You can finally fine-tune the NeMo EN-ZH model with your custom dataset. The batch size depends on the size of GPU memory.

After the training is completed, the results and checkpoints are saved to ./output/AAYNBaseFineTune path. You can use TensorBoard to visualize the loss curve or review the log files. 

Fine-tuning the ALMA NMT model

To train the ALMA Model, clone its repo from GitHub and install additional dependencies at the first step in the NeMo framework container.

You can place the fine-tuning datasets ( train.zh-en.json , valid.zh-en.json , test.zh-en.json ) in the /human_written_data/zhen data directory, where the /zhen subdirectory is used in this case for English/Chinese parallel datasets.

The next step is to modify the parameters in the /runs/parallel_ft_lora.sh and /configs/deepspeed_train_config.yaml files. 

Typical fields to be modified in /runs/parallel_ft_lora.sh :

  • per_device_train_batch_size : Training batch size.
  • gradient_accumulation_steps : Number of accumulated steps before gradient update.
  • learning_rate : Adjust according to the batch size and the number of accumulation steps.

Fields to modify in /configs/deepspeed_train_config.yaml :

  • gradient_accumulation_steps : Should match the value in /runs/parallel_ft_lora.sh .
  • num_processes : Number of GPU devices.

Fine-tuning command

To tune ALMA with LoRA for both English-to-Chinese and Chinese-to-English, run the following command in the ALMA repo’s root directory:

The results are stored in the /output directory:

  • adapter_config.json : LoRA config.
  • adapter_model.bin : LoRA weight.
  • trainer_state.json : Training losses.

Fine-tuned model evaluation

In the previous section, you evaluated the performance of the NeMo and ALMA pre-trained models without any modification. At this point, you can benchmark their fine-tuned models by running with the same test dataset again.

Fine-tuned NeMo model evaluation

Use the following script to run the same test dataset:

  • $fine_tuned_nemo_path : Fine-tuned NeMo model path.
  • input_en.txt : English text file.
  • nemo_ft_out_zh.txt : Output translated text file. 
  • reference.txt : Reference translation. 

The BLEU score is computed at the end.

Fine-tuned ALMA model evaluation 

Evaluating the ALMA model must run the inference manually on the same evaluation dataset. The following script is the inference code of the fine-tuned ALMA model, where the model-loading part is slightly different from the one discussed earlier, as it reads the fine-tuned PEFT model and config locally.

You can modify the sample inference script to generate translations for the custom English text file and benchmark its BLEU score as described earlier.

The NeMo framework container provides a convenient environment for various inference and customization tasks:

  • Multimodal models
  • Computer vision
  • Automatic speech recognition (ASR)
  • Text-to-speech (TTS)
  • Neural machine translation (NMT)

In this series, we introduced the NeMo NMT model and the ALMA NMT model fine-tuning recipes from scratch in the container, covering pretrained model inference and evaluation, data collection and preprocessing, model fine-tuning, and final evaluation. 

For more information about building and deploying fully customizable multilingual conversational AI pipelines, see NVIDIA Riva . You can also learn more about real-time bilingual and multilingual speech-to-speech and speech-to-text  translation APIs.

To get started with more LLM and distributed training tasks in the NeMo framework, explore the playbooks and developer documentation .

For more information about tackling data curation and evaluation tasks, see the recently released NeMo Curator and NVIDIA NeMo Evaluator microservices and apply for NeMo Microservices early access .

Related resources

  • GTC session: Customizing Foundation Large Language Models in Diverse Languages With NVIDIA NeMo
  • GTC session: Large Language Model Fine-Tuning using NVIDIA NeMo (Presented by Domino Data Lab)
  • GTC session: Large Language Model Fine-Tuning using Parameter Efficient Fine-Tuning (PEFT)
  • SDK: NeMo LLM Service
  • SDK: NeMo Megatron

About the Authors

Avatar photo

Related posts

Decorative image of a globe surrounded by people speaking and texting in different languages, with the text Part 1.

Customizing Neural Machine Translation Models with NVIDIA NeMo, Part 1

Decorative image of text and speech recognition processes encircling the globe.

New Standard for Speech Recognition and Translation from the NVIDIA NeMo Canary Model

thesis machine translation evaluation

Fine-Tune and Align LLMs Easily with NVIDIA NeMo Customizer

thesis machine translation evaluation

Build Custom Enterprise-Grade Generative AI with NVIDIA AI Foundation Models 

thesis machine translation evaluation

Neural Machine Translation Inference with TensorRT 4

Image of a gridded cube with purple and green dots.

Explainer: What Is a Vector Database?

Decorative image.

Visual Language Models on NVIDIA Hardware with VILA

Decorative image of VILA and Jetson Orin workflow.

Visual Language Intelligence and Edge AI 2.0

thesis machine translation evaluation

Spotlight: Continental and SoftServe Deliver Generative AI-Powered Virtual Factory Solutions with OpenUSD

IMAGES

  1. (PDF) Applying Machine Translation Evaluation Techniques to Textual CBR

    thesis machine translation evaluation

  2. Evaluation of Machine Translation Methods applied to Medical

    thesis machine translation evaluation

  3. PPT

    thesis machine translation evaluation

  4. LEPOR: an augmented machine translation evaluation metric

    thesis machine translation evaluation

  5. MT SUMMIT13.Language-independent Model for Machine Translation

    thesis machine translation evaluation

  6. Machine Translation Evaluation with Textual Entailment Features

    thesis machine translation evaluation

VIDEO

  1. Quine's Indeterminacy of Translation thesis

  2. CMU Multilingual NLP 2022

  3. Thesis and Dissertation Evaluation Format in All Ethiopian Universities(በአማርኛ)

  4. Column Generation in Machine Learning| Krunal Thesis background

  5. Lecture 5 Medical Translation

  6. Thesis Machine

COMMENTS

  1. Machine translation and its evaluation: a study

    Machine translation (namely MT) has been one of the most popular fields in computational linguistics and Artificial Intelligence (AI). As one of the most promising approaches, MT can potentially break the language barrier of people from all over the world. Despite a number of studies in MT, there are few studies in summarizing and comparing MT ...

  2. [2202.11027] An Overview on Machine Translation Evaluation

    View PDF Abstract: Since the 1950s, machine translation (MT) has become one of the important tasks of AI and development, and has experienced several different periods and stages of development, including rule-based methods, statistical methods, and recently proposed neural network-based learning methods. Accompanying these staged leaps is the evaluation research and development of MT ...

  3. PDF Translation Quality Assessment: A Brief Survey on Manual and Automatic

    to machine translation (MT), such as au-tomatic text summarization (ATS), natural language understanding (NLU) and natu-ral language generation (NLG). 1 1 Introduction Machine translation (MT) research, starting from the 1950s (Weaver, 1955), has been one of the main research topics in computational linguis-

  4. PDF Master Thesis Using Machine Learning Methods for Evaluating the ...

    States forming the NIST algorithm for machine translation evaluation by the National Institute of Standards and Technology. This algorithm weights matching words accord-ing to their frequency in the respective reference translation [3]. A second evolution of the BLEU metric, is the Metric for Evaluation of Translation with Explicit Ordering,

  5. How to evaluate machine translation: A review of automated and human

    Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL-04), Main Volume, Barcelona, Spain, pp. 605 - 612.CrossRef Google Scholar

  6. Machine translation systems and quality assessment: a ...

    Nowadays, in the globalised context in which we find ourselves, language barriers can still be an obstacle to accessing information. On occasions, it is impossible to satisfy the demand for translation by relying only in human translators, therefore, tools such as Machine Translation (MT) are gaining popularity due to their potential to overcome this problem. Consequently, research in this ...

  7. PDF Intelligent Hybrid M an -machine Translation Evaluation a Thesis

    Effective evaluation approaches should be concerned with all translation aspects. However, the most important aspect of machine translation is the output quality. This motivated an extensive research effort in the area of evaluating the quality of translation texts. Machine translation evaluation methods can be divided into two main categories ...

  8. PDF Scientific Credibility of Machine Translation Research: A Meta

    Abstract. This paper presents the first large-scale meta-evaluation of machine translation (MT). We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020. Our study shows that practices for automatic MT evaluation have dramatically changed dur-ing the past decade and follow concerning trends.

  9. A Comprehensive Study of Machine Translation Tools and Evaluation

    It is a statistical machine translation decoder tool developed by researcher Phillip Koehn as part of his Ph.D. thesis at the University of South California to aid the researchers in the area of machine translation. It uses a beam search algorithm. ... The machine translation evaluation is broadly classified into three different categories ...

  10. PDF Multi-Hypothesis Machine Translation Evaluation

    2 Related Work. Meteor (Banerjee and Lavie, 2005) was the first MT evaluation metric to relax the exact match con-straint between MT system output and reference translation by allowing matching of lemmas, syn-onyms or paraphrases. However, this requires lin-guistic resources which do not exist for most lan-guages.

  11. A Comprehensive Survey on Various Fully Automatic Machine Translation

    There are two approaches for machine translation evaluation: manual, subjective, and qualitative assessment, done by human experts, and automatic and objective and numerical evaluation implemented ...

  12. PDF Empirical Machine Translation and its Evaluation

    In this thesis we have exploited current Natural Language Processing technology for Empirical Machine Translation and its Evaluation. On the one side, we have studied the problem of automatic MT evaluation. We have analyzed the main deficiencies of current evaluation methods, which a rise, in our opinion, from the shallow

  13. PDF Machine Translationness: a Concept for Machine Translation Evaluation

    Machine translationness (MTness) is the linguistic phenomena that make machine trans-lations distinguishable from human translations. This thesis intends to present MT-ness as a research object and suggests an MT evaluation method based on determining whether the translation is machine-like instead of determining its human-likeness as in

  14. Transforming machine translation: a deep learning system ...

    The quality of human translation was long thought to be unattainable for computer translation systems. In this study, we present a deep-learning system, CUBBITT, which challenges this view. In a ...

  15. Machine translation and its evaluation: a study

    Machine translation (namely MT) has been one of the most popular fields in computational linguistics and Artificial Intelligence (AI). As one of the most promising approaches, MT can potentially break the language barrier of people from all over the world. Despite a number of studies in MT, there are few studies in summarizing and comparing MT methods. To this end, in this paper, we ...

  16. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation

    Abstract. Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly accepted standard procedure. As a step toward this goal, we propose an evaluation ...

  17. An Analysis of the Evaluation of the Translation Quality of Neural

    Ultimately, the evaluation of machine translation quality is a linguistic issue of comparing sentences; therefore, scholars must combine machine translation with linguistic research (Guzmán et al. Citation 2017). At present, most scholars focus on the evaluation of English Chinese machine translation. However, there are still few papers on the ...

  18. Machine Translation Evaluation: The Ultimate Guide

    Machine translation evaluation refers to the different processes of measuring the performance of a machine translation system. It's a way of scoring the quality of MT so that it's possible to know how good the system is, and there's a solid basis to compare how effective different MT systems are. To do this, machine translation evaluation ...

  19. Comparing Machine Translation and Human Translation: A Case Study

    Comparing Machine Translation and Human Translation: A Case Study. November 2017. DOI: 10.26615/978-954-452-042-7_003. Conference: RANLP 2017 - Workshop on Human-Informed Translation and ...

  20. Approaches to Human and Machine Translation Quality Assessment

    In both research and practice, translation quality assessment is a complex task involving a range of linguistic and extra-linguistic factors. This chapter provides a critical overview of the established and developing approaches to the definition and measurement of translation quality in human and machine translation workflows across a range of research, educational, and industry scenarios.

  21. PDF A Review Study of the Application of Machine Translation in ...

    : The progress of translation technologies has presented new opportunities and challenges to translation education, and the application of machine translation in education has received growing attention among researchers. This paper reviewed a total of 40 studies (19 empirical studies and 21 position papers) published in five SSCI-indexed and three

  22. PDF Repositori Institucional (O2): Página de inicio

    Repositori Institucional (O2): Página de inicio

  23. Dissertations / Theses: 'Machine translating'

    In this thesis a prototype machine translation system is presented. This system is designed to translate English text into a gloss based representation of South African Sign Language (SASL). ... new and adapted models correlate with human judgements of translation quality and suggest that improvements in general evaluation within machine ...

  24. Word Alignment as Preference for Machine Translation

    The problem of hallucination and omission, a long-standing problem in machine translation (MT), is more pronounced when a large language model (LLM) is used in MT because an LLM itself is susceptible to these phenomena. In this work, we mitigate the problem in an LLM-based MT model by guiding it to better word alignment. We first study the correlation between word alignment and the phenomena ...

  25. Sentiment Analysis Across Languages: Evaluation Before and After

    People communicate in more than 7,000 languages around the world, with around 780 languages spoken in India alone. Despite this linguistic diversity, research on Sentiment Analysis has predominantly focused on English text data, resulting in a disproportionate availability of sentiment resources for English. This paper examines the performance of transformer models in Sentiment Analysis tasks ...

  26. Customizing Neural Machine Translation Models with NVIDIA NeMo, Part 2

    In the first post, we walked through the prerequisites for a neural machine translation example from English to Chinese, running the pretrained model with NeMo, and evaluating its performance.In this post, we walk you through curating a custom dataset and fine-tuning the model on that dataset. Custom data collection. Custom data collection is crucial in model fine-tuning because it enables a ...

  27. Improving the quality evaluation process of machine learning algorithms

    Powers, 2008 Powers D., Evaluation: from precision, Recall and F-factor to ROC, informedness, markedness & correlation, International Journal of Machine Learning Technology 2 (1) (2008) 37 - 63, 10.48550/arXiv.2010.16061. 2011. Google Scholar Cross Ref