[Possible PR discuss] Will a PR of training HF model be welcomed? #903

junjzhang · 2025-02-28T03:13:40Z

Hi! We are in the process of developing a novel training framework for Reinforcement Learning (RL) following TorchTitan. Recently, we've developed a feature to support direct training from Hugging Face (HF) models and the loading safetensors in online sharded fashion. This may substantially cuts down the cost of adapting a new model. All you have to do is implement the parallelism applying function.
Given this, I wonder whether a PR with the relevant code and a training example for training Hugging Face's Llama model is welcomed. I think this addition will be of great benefit to many in the community.
By the way, during my testing, I found that the HF Llama model demonstrates competitive TPS when compared to the model implemented in TorchTitan.

lessw2020 · 2025-02-28T06:11:10Z

Hi @junjzhang - I can only speak my opinion, but generically anything that helps Titan enable RL type training would be of significant interest.
We are also opening up a new "experimental" folder with the idea of enabling more contributions to have a home as well ... so that's another angle that may help your PR to land. The first PR landing in there currently also uses HF aspects for reference (see /~https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/deepseek_v3/attn_mask_utils.py).

Thus, while I don't think anyone can say an unseen PR will 100% be accepted, I can say it would definitely be of interest, and I think it would be worth the effort to post the PR so it can be reviewed/discussed/considered for inclusion.
Thanks very much for opening up the discussion!
Maybe @tianyu-l can weigh in here as well.

junjzhang · 2025-02-28T07:28:42Z

Hi @junjzhang - I can only speak my opinion, but generically anything that helps Titan enable RL type training would be of significant interest. We are also opening up a new "experimental" folder with the idea of enabling more contributions to have a home as well ... so that's another angle that may help your PR to land. The first PR landing in there currently also uses HF aspects for reference (see /~https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/deepseek_v3/attn_mask_utils.py).

Thus, while I don't think anyone can say an unseen PR will 100% be accepted, I can say it would definitely be of interest, and I think it would be worth the effort to post the PR so it can be reviewed/discussed/considered for inclusion. Thanks very much for opening up the discussion! Maybe @tianyu-l can weigh in here as well.

Thanks for replying! I thought I could clean up my code and make a draft pr to experiments dir first!

tianyu-l · 2025-03-02T21:48:20Z

Hey @junjzhang thanks for proposing! We agree this feature is good to have.

As @lessw2020 suggested, let's create new folder hosting HF training under the experiments folder:

load HF model weights
showcase an example of training by "implement the parallelism applying function", and reusing TrainSpec
support converting weights back to HF formats

Relevant discussions:

Maybe we can work with other people who've shown interests and made offline progresses, on this project.
cc: @yzhangcs @neeldani @huyiwen @bkchang

junjzhang · 2025-03-03T02:21:05Z

Hey @junjzhang thanks for proposing! We agree this feature is good to have.

As @lessw2020 suggested, let's create new folder hosting HF training under the experiments folder:

load HF model weights

showcase an example of training by "implement the parallelism applying function", and reusing TrainSpec

support converting weights back to HF formats

Relevant discussions:

Llama models with custom configurations and uploading to Hugging Face #420

Model init with HuggingFace model #743

Mitigation to HuggingFace Trainer #824

Maybe we can work with other people who've shown interests and made offline progresses, on this project. cc: @yzhangcs @neeldani @huyiwen @bkchang

I've finished features 1 and 2. And I think you can easily implement feature 3 by reusing PretrainedModel's save_model weights. I'll try to clean up the relative codes and pull a PR this week. BTW, this feature will introduce extra requirements like transformers. How would you expect this to be handled in the experiment dir?

tianyu-l added huggingface integration community help wanted labels Mar 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Possible PR discuss] Will a PR of training HF model be welcomed? #903

[Possible PR discuss] Will a PR of training HF model be welcomed? #903

junjzhang commented Feb 28, 2025 •

edited

Loading

lessw2020 commented Feb 28, 2025 •

edited

Loading

junjzhang commented Feb 28, 2025

tianyu-l commented Mar 2, 2025

junjzhang commented Mar 3, 2025

[Possible PR discuss] Will a PR of training HF model be welcomed? #903

[Possible PR discuss] Will a PR of training HF model be welcomed? #903

Comments

junjzhang commented Feb 28, 2025 • edited Loading

lessw2020 commented Feb 28, 2025 • edited Loading

junjzhang commented Feb 28, 2025

tianyu-l commented Mar 2, 2025

junjzhang commented Mar 3, 2025

junjzhang commented Feb 28, 2025 •

edited

Loading

lessw2020 commented Feb 28, 2025 •

edited

Loading