• Address GEC as a translation problem, train large seq-to-seq model on parallel texts
  • Distill to servable smaller and more efficient model that is not neccessarily seq-to-seq - predict only what you need



A Simple Recipe for Multilingual Grammatical Error Correction

  1. data set: cLang-8(“cleaned Lang-8”)
    • Lang-8 learner corpus covers 80 languages with langauge learners correcting each other’s texts
    • cLang-8: above but with target sequence produced by this paper’s resulting gT5 model
  2. Previous approaches: multi stage fine tuning
    • synthetic data
    • labeled data
    • learning rate and steps for both stage need to be tuned
  3. This paper:
    • mT5(multilingual T5) as base model
      • xxl size 13B parameters
      • pre-trained on mC4(101 languages) on span-prediction
    • GEC unsupervised language-agnostic pre-training
      • pretrain on synthetic data generated by automatically corrupting sentences(rather simple but universal across languages)
        1. drop spans of tokens
        2. swap tokens
        3. drop spans of characters
        4. swap characters
        5. insert characters
        6. lower-case a word
        7. upper-case the first character of a word
        8. 2% data un-modified
      • finetune on supervised GEC data
        1. finetune dataset
          • English: FCE, W&I (together called BEA)
          • Czech, German, Russian: AKCES-GEC, Falko-MERLIN, RULEC-GEC
        2. Setup:
          • Mixing GEC pre-training with finetuning
          • Mixing but with different prefixes
          • Pretrain until converge then finetune -> best method
          • constant leardning rate
          • pretraining converge at 800k examples ~= 7 epochs
      • Student models:
        • trained on cLang-8
        • if further fine tune student model on BEA, score actually drops:
          • since cLang-8 labels are generated using a model tuned on BEA, further fine tune student model on BEA overfits?

Grammatical Error Correction in Low-Resource Scenarios