Reproducing the flan_v2 results of T5-xl

First, thanks for this excellent work. However, I met some problems when reproducing the results of T2-xl. 

My setting is:

Pretrained model and optimizer:
I used the [T5-v1_1-xl](https://huggingface.co/google/t5-v1_1-xl) pretrained model and following the training setting in "Scaling Instruction-Finetuned Language Models": batch size 64, Dropout 0.05, LR 5e-4, 38K steps, adafactor optimizer. 

Data:
For the data, I first used the training data provided by [SirNeural](https://huggingface.co/datasets/SirNeural/flan_v2l) and evaluated the model on MMLU. When I equally sampled the 5 datasets (i.e. cot, flanv2, t0, diglog, niv2), I got  45% 5-shot accuracy on MMLU, which is similar to the  w/o mixture balancing result in the paper. However,  after I mixed the data with the suggested rates [here](/~https://github.com/google-research/FLAN/blob/main/flan/v2/run_example.py#L65-L73), the accuracy is not improved (44%). 

Afterwards, I tried the data provided by [Enrico Shippole](/~https://github.com/conceptofmind) and mixed the data following the suggested rates. But the accuracy became worse (42% on MMLU). I also tried to use a larger batch size (128, considering batch packing ) and deduplicate the data, which nearly didn't help.

Are there any suggestions to reproduce the MMLU results of the released [Flan-xl-t5](https://huggingface.co/google/flan-t5-xl) model (49%) or even the results in the paper(52%) ?  Thanks a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing the flan_v2 results of T5-xl #80

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development