Description
First, thanks for this excellent work. However, I met some problems when reproducing the results of T2-xl.
My setting is:
Pretrained model and optimizer:
I used the T5-v1_1-xl pretrained model and following the training setting in "Scaling Instruction-Finetuned Language Models": batch size 64, Dropout 0.05, LR 5e-4, 38K steps, adafactor optimizer.
Data:
For the data, I first used the training data provided by SirNeural and evaluated the model on MMLU. When I equally sampled the 5 datasets (i.e. cot, flanv2, t0, diglog, niv2), I got 45% 5-shot accuracy on MMLU, which is similar to the w/o mixture balancing result in the paper. However, after I mixed the data with the suggested rates here, the accuracy is not improved (44%).
Afterwards, I tried the data provided by Enrico Shippole and mixed the data following the suggested rates. But the accuracy became worse (42% on MMLU). I also tried to use a larger batch size (128, considering batch packing ) and deduplicate the data, which nearly didn't help.
Are there any suggestions to reproduce the MMLU results of the released Flan-xl-t5 model (49%) or even the results in the paper(52%) ? Thanks a lot.