Evaluating grammatical error corrections

Organized by cnapoles - Current server time: May 23, 2019, 7:54 p.m. UTC


Nov. 23, 2016, midnight UTC


Competition Ends

This platform evaluates grammatical error corrections of the CoNLL 2014 Shared Task test set [1], and is released to accompany the following paper:

Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault
There’s No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction
EMNLP 2016

Please include the following citation if you use this toolkit.

  author    = {Napoles, Courtney  and  Sakaguchi, Keisuke  and  Tetreault, Joel},
  title     = {There's No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction},
  booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing},
  month     = {November},
  year      = {2016},
  address   = {Austin, Texas},
  publisher = {Association for Computational Linguistics},
  pages     = {2109--2115},
  url       = {https://aclweb.org/anthology/D16-1228}

The code for executing this evaluation program is also available from our git repository: https://github.com/cnap/grammaticality-metrics


The CoNLL 2014 test set can be obtained from the official shared task website:


This following metrics and reference sets are supported in this competition:


  • Reference-based metrics (RBMs)
    • GLEU [2]
    • I-measure [3] ** Not supported in CodaLab
    • M2 [4]
  • Grammaticality-based metrics (GBM)
    • LT
  • Interpolated metrics
    • LT interpolated with each RBM

Reference sets

  • NUCLE references [1]
  • non-expert fluency edits [5]
  • non-expert minimal edits [5]
  • expert fluency edits [5]
  • expert minimal edits [5]


The scripts for calculating GLEU, I-measure, and M2 were modified to return sentence-level scores and so that they can be called by an external program. At this date, CodaLab does not support Java 8, so we are using the most recent version of LanguageTool that supports Java 7 (v3.1). I-measure takes several minutes to run and exceeds the time limit imposed by CodaLab on scoring programs. Therefore, it is not enabled in the online CodaLab competition, but you can run it from the original repository (https://github.com/mfelice/imeasure) or our git repository (https://github.com/cnap/grammaticality-metrics).


1. Ng et al. The CoNLL-2014 Shared Task on grammatical error correction. In Proceedings of CoNLL, 2014.
2. Napoles et al. Ground truth for grammatical error correction metrics. In Proceedings of ACL, 2015.
3. Felice and Briscoe. Towards a standard evaluation method for grammatical error detection and correction. In Proceedings of NAACL, 2015.
4. Dahlmeier and Ng. Better evaluation for grammatical error correction. In Proceedings of NAACL, 2012.
5. Sakaguchi et al. Reassessing the goals of grammatical error correction: Fluency instead of grammaticality. TACL, 2016.


In the future, we hope to expand the evaluation to include new data, references, and metrics. Feel free to contact us with any suggestions, questions, or comments.

Courtney Napoles (napoles@cs.jhu.edu)

Each submission should be a zip file containing answer.txt, which contains one sentence of output per line, aligned to the original CoNLL 2014 test set.

The following metrics will be calculated on different reference sets:

  • LT
  • GLEU
  • M2
  • Interpolated LT+GLEU and LT+M2, using lambda optimized by both Pearson's and Spearman's correlation coefficients. 

Reference sets:

  • NUCLE annotations
  • Expert and non-expert fluency and minimal edits

The full scores can be optained by downloading the scoring output. To obtain scores using I-measure or other reference sets, please dowwnload the scorer. We have excluded I-measure from the CodaLab site because it can take several minutes, in which time CodaLab may timeout.

This site was created to support uniform evaluation of grammatical error corrections and is not a true competition. Any information submitted to the "competition" is the property of the participant and will not be used by the organizers for any purposes.

Because the test set is entirely public, gamed submissions are possible, so the highest scores on the leaderboard may not be the true best results on this dataset.

Visibility and Privacy

Unless you submit your results to the leaderboard or post in a forum, your participation in this "competition" is hidden, except to the organizers. Submitting results to the leaderboard is optional. When a result is submitted to the leaderboard, any participant can view the system output.

If you wish for your system output to remain private, do not submit the results to the leaderboard. At this time, CodaLab does not provide support for participants to delete their submissions. If you wish for your submissions to be removed please contact the organizers. CodaLab is set up so that the organizers can view all submissions, but we pledge to not view or use any results that are not submitted to the leaderboard.

If you wish to remain entireley anonymous, the scoring program can be downloaded and run locally: https://github.com/cnap/grammaticality-metrics

Please contact the organizers of this competition with any questions or concerns.


Start: Nov. 23, 2016, midnight

Description: Evaluate corrections of the CoNLL-2014 shared task test set with GLEU, GLEU interpolated with LT (lambda optimized to Spearman's rho), and M2. GLEU and M2 are calculated using all available references. View the scoring output log to see the scores from other metrics and reference sets. If your submission fails, please make sure that you have uploaded a zipped file containing answer.txt. Occasionally, CodaLab raises a permissions error and the CodaLab developers do not know why this happens. If you receive this error, please resubmit and try again (or contact the organizers to rerun the scorer over your existing submission). We apologize for this inconvenience!

Competition Ends


You must be logged in to participate in competitions.

Sign In