Idea
- Address GEC as a translation problem, train large seq-to-seq model on parallel texts
- Distill to servable smaller and more efficient model that is not neccessarily seq-to-seq - predict only what you need
Data
Papers
- data set: cLang-8(“cleaned Lang-8”)
- Lang-8 learner corpus covers 80 languages with langauge learners correcting each other’s texts
- cLang-8: above but with target sequence produced by this paper’s resulting gT5 model
- Previous approaches: multi stage fine tuning
- synthetic data
- labeled data
- learning rate and steps for both stage need to be tuned
- This paper:
- mT5(multilingual T5) as base model
- xxl size 13B parameters
- pre-trained on mC4(101 languages) on span-prediction
- GEC unsupervised language-agnostic pre-training
- pretrain on synthetic data generated by automatically corrupting sentences(rather simple but universal across languages)
- drop spans of tokens
- swap tokens
- drop spans of characters
- swap characters
- insert characters
- lower-case a word
- upper-case the first character of a word
- 2% data un-modified
- finetune on supervised GEC data
- finetune dataset
- English: FCE, W&I (together called BEA)
- Czech, German, Russian: AKCES-GEC, Falko-MERLIN, RULEC-GEC
- Setup:
- Mixing GEC pre-training with finetuning
- Mixing but with different prefixes
- Pretrain until converge then finetune -> best method
- constant leardning rate
- pretraining converge at 800k examples ~= 7 epochs
- Student models:
- trained on cLang-8
- if further fine tune student model on BEA, score actually drops:
- since cLang-8 labels are generated using a model tuned on BEA, further fine tune student model on BEA overfits?