Model init with HuggingFace model #743

neeldani · 2024-12-16T05:45:04Z

I am writing a simple script to run FSDP2 (fully_shard) on the pythia-1b model available on HuggingFace. I am currently running the model on 1 node with 2 devices. I was following the meta-device initialisation from the FSDP2 docs. However, I think there is something wrong with my implementation since the peak memory usage with FSDP is same as without FSDP (~ 1GB). Further, I get an OOM on my device when I try with pythia-2.8b model. Following is a snippet on how I am initialising the model on a meta device using HuggingFace APIs:

model_name = "EleutherAI/pythia-14m"
    
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
config = AutoConfig.from_pretrained(model_name)
    with init_empty_weights():
        model = AutoModelForCausalLM.from_config(config)

    for module in model.modules():
        if isinstance(module, GPTNeoXLayer):
            fully_shard(module)
    
    model = fully_shard(model, reshard_after_forward=True)

    model = load_checkpoint_and_dispatch(
        model, path_to_safe_tensors
    )

This is not very straightforward since the shards expect DTensors when the weights are being loaded via load_checkpoint_and_dispatch. I am looking for some suggestions on what would be a good way to make FSDP2 work with HuggingFace models. I dont think accelerate supports FSDP2 yet.

The text was updated successfully, but these errors were encountered:

awgu · 2024-12-16T17:10:29Z

cc: @weifengpy @mori360

neeldani · 2024-12-19T17:27:32Z

👋 Gentle bump on this - mainly to see if there is some workaround for the above issue 👀

mori360 · 2024-12-20T20:57:56Z

However, I think there is something wrong with my implementation since the peak memory usage with FSDP is same as without FSDP (~ 1GB).

It depends on where you have the peak memory. If it's on fully_shard, then the full_state_dict would shard to a local_state_dict, causing a greater memory. (full_state_dict + local_state_dict > full_state_dict)

I get an OOM on my device when I try with pythia-2.8b model

Could you give more details on the safe_tensors as I could repro the huge memory cost.
Also, could you give a device flow so that I could follow up when you switch you devices to gpu.

neeldani · 2024-12-23T07:32:57Z

It depends on where you have the peak memory. If it's on fully_shard, then the full_state_dict would shard to a local_state_dict, causing a greater memory. (full_state_dict + local_state_dict > full_state_dict)

I see. Ideally I am looking for an approach which allows me to load the sharded models on each GPU without loading the full_state_dict

Could you give more details on the safe_tensors as I could repro the huge memory cost.

I downloaded the model.safetensors for the pythia-1b model from here. These weights are not sharded

Also, could you give a device flow so that I could follow up when you switch you devices to gpu.

I am trying to mimic TorchTitan's implementation but with a HuggingFace model

Load the empty model on the meta device
Apply fsdp, move sharded weights to the respective GPUs and materialise the weights
re-initialise the sharded weights on each GPU

This is a simple repro of my implementation which can be run using:

torchrun --nnodes=1 --nproc_per_node=2 reproduce.py

import os

import torch
from torch.distributed import init_process_group, destroy_process_group
from torch.distributed._composable.fsdp import fully_shard
from transformers import AutoConfig, AutoModelForCausalLM
from transformers.models.gpt_neox.modeling_gpt_neox import GPTNeoXLayer
from accelerate import init_empty_weights, load_checkpoint_and_dispatch

def get_num_params(model: torch.nn.Module, exclude_embedding: bool = False) -> int:
    num_params = sum(p.numel() for p in model.parameters())
    if exclude_embedding:
        num_params -= model.tok_embeddings.weight.numel()
    return num_params

def setup(local_rank, world_size):
    device = torch.device(f"cuda:{local_rank}")
    torch.cuda.set_device(device)
    init_process_group("nccl", rank=local_rank, world_size=world_size)

def load():
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    setup(local_rank, world_size)

    model_name = "EleutherAI/pythia-2.8b"
    config = AutoConfig.from_pretrained(model_name)
    
    with init_empty_weights():
        model = AutoModelForCausalLM.from_config(config)
    
    if local_rank == 0:
        print("Load models with empty weights")
        print("Device: ", model.device)
        print("Params: ", get_num_params(model))
        print("Peak mem: ", torch.cuda.max_memory_allocated() / (1024 ** 3))

    for module in model.modules():
        if isinstance(module, GPTNeoXLayer):
            fully_shard(module)
    
    model = fully_shard(model, reshard_after_forward=True)

    if local_rank == 0:
        print("Applied FSDP to the model")
        print("Device: ", model.device)
        print("Peak mem: ", torch.cuda.max_memory_allocated() / (1024 ** 3))
        print("# of params: ", get_num_params(model))

    model.to_empty(device='cuda')

    if local_rank == 0:
        print("Materialized the sharded tensors")
        print("Device: ", model.device)
        print("Peak mem: ", torch.cuda.max_memory_allocated() / (1024 ** 3))
        print("# of params: ", get_num_params(model))

    model = load_checkpoint_and_dispatch(model, "model.safetensors", device_map="auto", no_split_module_classes="GPTNeoXLayer")

if __name__ == "__main__":
    load()

The flow is very similar to that of TorchTitan's except that TorchTitan makes an explicit call to re-initialise the weights after materialising them. Since I wish to load weights from a pretrained HF model, its a bit challenging. The above code throws an error where I call load_checkpoint_and_dispatch since the model expects DTensors as inputs.

mori360 · 2024-12-27T04:24:00Z

Ideally I am looking for an approach which allows me to load the sharded models on each GPU without loading the full_state_dict

torch.distributed.checkpoint.state_dict.set_model_state_dict could load the sharded model without loading the full_state_dict at one time as it conducts loading param by param to avoid the memory peak(to help avoid the OOM).

However, accelerate.load_checkpoint_and_dispatch does not support sharded model right now, without condition for param_cls.__name__ in [''DTensor"] to conduct distribute

@fegin Please correct me if I'm wrong. Also, shall we update model.init_weight() in torchtitan in the process from model.init_weight() to checkpoint.load() to to init weight param by param?

fegin · 2025-01-08T05:14:33Z

Yes, @mori360, as you have implemented this feature, OOM should be able to avoid with set_model_state_dict. But we will need the state_dict to be loaded with DCP and set_model_state_dict.

Hannibal046 · 2025-02-01T11:47:05Z

Hi, any progress here? What is the best practice to continue pretrain a HF model with torchtitan?

yzhangcs · 2025-02-18T19:09:52Z

@neeldani Regarding your orginal issue, for now, the easiest approach would be to:

Convert your HF model weights to the DCP format and save them in <path>/checkpoint/step-0. You can follow the instructions in this guide: How to Convert a LLaMA 3 Checkpoint for Use in TorchTitan. Replace <path> with your desired save location.
Once the weights are converted, you can resume training directly by setting --training.load_step 0, similar to how you would with a seed checkpoint.

Does this make sense? @mori360 @fegin @tianyu-l @huyiwen, please correct me if I missed anything.

tianyu-l · 2025-02-19T04:47:40Z

Thanks @yzhangcs

What is the best practice to continue pretrain a HF model with torchtitan?

I think the key thing to do is to convert a HF checkpoint into a DCP checkpoint, like what this script does #305 (comment)

I heard that DCP is going to support HF checkpointing format, but it may take some time to happen.
related PR for the non-distributed use case: pytorch/pytorch#146352
cc: @fegin @kwen2501 to confirm

yzhangcs · 2025-02-19T06:09:45Z

@tianyu-l I just wrote one for medium/small-sized models /~https://github.com/fla-org/flame/blob/main/convert_hf_to_dcp.py
like /~https://github.com/pytorch/torchtitan/blob/main/scripts/convert_llama_to_dcp.py.
I’m using the converted DCPs to finetune the Qwen model on finweb-edu, and everything appears to be working as expected so far.

weifengpy assigned mori360 Dec 18, 2024

tianyu-l added question Further information is requested bug Something isn't working labels Jan 7, 2025

tianyu-l added the module: checkpoint label Feb 19, 2025

tianyu-l mentioned this issue Mar 2, 2025

[Possible PR discuss] Will a PR of training HF model be welcomed? #903

Open

tianyu-l added the huggingface integration label Mar 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model init with HuggingFace model #743

Model init with HuggingFace model #743

neeldani commented Dec 16, 2024 •

edited

Loading

awgu commented Dec 16, 2024

neeldani commented Dec 19, 2024

mori360 commented Dec 20, 2024 •

edited

Loading

neeldani commented Dec 23, 2024 •

edited

Loading

mori360 commented Dec 27, 2024 •

edited

Loading

fegin commented Jan 8, 2025

Hannibal046 commented Feb 1, 2025

yzhangcs commented Feb 18, 2025 •

edited

Loading

tianyu-l commented Feb 19, 2025 •

edited

Loading

yzhangcs commented Feb 19, 2025

Model init with HuggingFace model #743

Model init with HuggingFace model #743

Comments

neeldani commented Dec 16, 2024 • edited Loading

awgu commented Dec 16, 2024

neeldani commented Dec 19, 2024

mori360 commented Dec 20, 2024 • edited Loading

neeldani commented Dec 23, 2024 • edited Loading

mori360 commented Dec 27, 2024 • edited Loading

fegin commented Jan 8, 2025

Hannibal046 commented Feb 1, 2025

yzhangcs commented Feb 18, 2025 • edited Loading

tianyu-l commented Feb 19, 2025 • edited Loading

yzhangcs commented Feb 19, 2025

neeldani commented Dec 16, 2024 •

edited

Loading

mori360 commented Dec 20, 2024 •

edited

Loading

neeldani commented Dec 23, 2024 •

edited

Loading

mori360 commented Dec 27, 2024 •

edited

Loading

yzhangcs commented Feb 18, 2025 •

edited

Loading

tianyu-l commented Feb 19, 2025 •

edited

Loading