reload existing llama checkpoints #305

tianyu-l · 2024-05-03T23:02:33Z

No description provided.

Lauler · 2024-05-11T18:02:04Z

Is this issue related to loading pretrained Llama2/Llama3 weights and using them as checkpoint?

I was going to start a separate issue asking for some docs that explain how to convert pretrained weights from HF to torchtitan in order to do continued pretraining. Is that already possible or on the roadmap?

fegin · 2024-05-13T21:27:19Z

DCP has the format util to help the conversion. However, HF conversion should not live in PyTorch code base.

tianyu-l · 2024-05-16T19:55:59Z

@lessw2020 will connect with HF to see if they can support weights conversion from HF to pytorch. After that, we may import that in the code or update the tutorial.

rlrs · 2024-05-21T12:07:10Z

I have a straightforward script for converting from HF to a DCP checkpoint, if that helps. Mostly the script already exists in gpt-fast.

tianyu-l · 2024-05-23T20:03:09Z

@rlrs Thanks, pls feel free to share it here!

As far as we know, HF is also working on such a script to convert from HF to DCP. As discussed in #335, we should include a script to convert from llama raw weights into DCP (similar to the one here), and it probably should still sit in pytorch/pytorch.

rlrs · 2024-05-24T10:59:35Z

Alright so this is the script I'm using for HF->DCP. It uses the safetensors weights (but can easily be converted to load a torch.save instead), which only exist in https://huggingface.co/meta-llama/Meta-Llama-3-8B/tree/main in the root, and not under original/. So as we discussed in #335, some of the weights are permuted compared to the original.
I've been using it to just create a step-0 checkpoint that torchtitan is already set up to start from.

import json
import re
import sys
from pathlib import Path
from safetensors import safe_open
import torch.distributed.checkpoint as DCP

import torch

# support running without installing as a package
wd = Path(__file__).parent.parent.resolve()
sys.path.append(str(wd))

from maester.models import models_config


@torch.inference_mode()
def convert_hf_checkpoint(
    *,
    checkpoint_dir: Path,
    output_dir: Path,
) -> None:
    # Load the json file containing weight mapping
    model_map_json = checkpoint_dir / "model.safetensors.index.json"

    assert model_map_json.is_file()

    with open(model_map_json, 'r') as json_map:
        bin_index = json.load(json_map)

    weight_map = {
        "model.embed_tokens.weight": "tok_embeddings.weight",
        "model.layers.{}.self_attn.q_proj.weight": "layers.{}.attention.wq.weight",
        "model.layers.{}.self_attn.k_proj.weight": "layers.{}.attention.wk.weight",
        "model.layers.{}.self_attn.v_proj.weight": "layers.{}.attention.wv.weight",
        "model.layers.{}.self_attn.o_proj.weight": "layers.{}.attention.wo.weight",
        'model.layers.{}.self_attn.rotary_emb.inv_freq': None,
        'model.layers.{}.mlp.gate_proj.weight': 'layers.{}.feed_forward.w1.weight',
        "model.layers.{}.mlp.up_proj.weight": "layers.{}.feed_forward.w3.weight",
        "model.layers.{}.mlp.down_proj.weight": "layers.{}.feed_forward.w2.weight",
        "model.layers.{}.input_layernorm.weight": "layers.{}.attention_norm.weight",
        "model.layers.{}.post_attention_layernorm.weight": "layers.{}.ffn_norm.weight",
        "model.norm.weight": "norm.weight",
        "lm_head.weight": "output.weight",
    }
    bin_files = {checkpoint_dir / bin for bin in bin_index["weight_map"].values()}

    merged_result = {}
    for file in sorted(bin_files):
        with safe_open(file, framework="pt", device="cpu") as f:
            for k in f.keys():
                merged_result[k] = f.get_tensor(k)
    final_result = {}
    
    for key, value in merged_result.items():
        if "layers" in key:
            abstract_key = re.sub(r'(\d+)', '{}', key)
            layer_num = re.search(r'\d+', key).group(0)
            new_key = weight_map[abstract_key]
            if new_key is None:
                continue
            new_key = new_key.format(layer_num)
        else:
            new_key = weight_map[key]

        final_result[new_key] = value

    output_dir.mkdir(parents=True, exist_ok=True)
    storage_writer = DCP.filesystem.FileSystemWriter(output_dir)
    DCP.save({"model": final_result}, 
             storage_writer=storage_writer)

if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser(description='Convert HuggingFace checkpoint.')
    parser.add_argument('--checkpoint', type=Path, required=True)
    parser.add_argument('--output', type=Path, required=True)

    args = parser.parse_args()
    convert_hf_checkpoint(
        checkpoint_dir=args.checkpoint,
        output_dir=args.output,
    )

kxgong · 2024-06-09T16:44:17Z

Alright so this is the script I'm using for HF->DCP. It uses the safetensors weights (but can easily be converted to load a torch.save instead), which only exist in https://huggingface.co/meta-llama/Meta-Llama-3-8B/tree/main in the root, and not under original/. So as we discussed in #335, some of the weights are permuted compared to the original. I've been using it to just create a step-0 checkpoint that torchtitan is already set up to start from.

import json
import re
import sys
from pathlib import Path
from safetensors import safe_open
import torch.distributed.checkpoint as DCP

import torch

# support running without installing as a package
wd = Path(__file__).parent.parent.resolve()
sys.path.append(str(wd))

from maester.models import models_config


@torch.inference_mode()
def convert_hf_checkpoint(
    *,
    checkpoint_dir: Path,
    output_dir: Path,
) -> None:
    # Load the json file containing weight mapping
    model_map_json = checkpoint_dir / "model.safetensors.index.json"

    assert model_map_json.is_file()

    with open(model_map_json, 'r') as json_map:
        bin_index = json.load(json_map)

    weight_map = {
        "model.embed_tokens.weight": "tok_embeddings.weight",
        "model.layers.{}.self_attn.q_proj.weight": "layers.{}.attention.wq.weight",
        "model.layers.{}.self_attn.k_proj.weight": "layers.{}.attention.wk.weight",
        "model.layers.{}.self_attn.v_proj.weight": "layers.{}.attention.wv.weight",
        "model.layers.{}.self_attn.o_proj.weight": "layers.{}.attention.wo.weight",
        'model.layers.{}.self_attn.rotary_emb.inv_freq': None,
        'model.layers.{}.mlp.gate_proj.weight': 'layers.{}.feed_forward.w1.weight',
        "model.layers.{}.mlp.up_proj.weight": "layers.{}.feed_forward.w3.weight",
        "model.layers.{}.mlp.down_proj.weight": "layers.{}.feed_forward.w2.weight",
        "model.layers.{}.input_layernorm.weight": "layers.{}.attention_norm.weight",
        "model.layers.{}.post_attention_layernorm.weight": "layers.{}.ffn_norm.weight",
        "model.norm.weight": "norm.weight",
        "lm_head.weight": "output.weight",
    }
    bin_files = {checkpoint_dir / bin for bin in bin_index["weight_map"].values()}

    merged_result = {}
    for file in sorted(bin_files):
        with safe_open(file, framework="pt", device="cpu") as f:
            for k in f.keys():
                merged_result[k] = f.get_tensor(k)
    final_result = {}
    
    for key, value in merged_result.items():
        if "layers" in key:
            abstract_key = re.sub(r'(\d+)', '{}', key)
            layer_num = re.search(r'\d+', key).group(0)
            new_key = weight_map[abstract_key]
            if new_key is None:
                continue
            new_key = new_key.format(layer_num)
        else:
            new_key = weight_map[key]

        final_result[new_key] = value

    output_dir.mkdir(parents=True, exist_ok=True)
    storage_writer = DCP.filesystem.FileSystemWriter(output_dir)
    DCP.save({"model": final_result}, 
             storage_writer=storage_writer)

if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser(description='Convert HuggingFace checkpoint.')
    parser.add_argument('--checkpoint', type=Path, required=True)
    parser.add_argument('--output', type=Path, required=True)

    args = parser.parse_args()
    convert_hf_checkpoint(
        checkpoint_dir=args.checkpoint,
        output_dir=args.output,
    )

Thanks for sharing.

bkchang · 2024-06-20T17:33:33Z

Is there a conversion in the other direction? Meaning converting a dcp checkpoint to an HF model? I found a util dcp_to_torch_save but am not sure how to go from there to a HF model.

tianyu-l · 2024-06-20T21:58:24Z

@bkchang From HF website, there's a script to convert llama weights to HF format.

bkchang · 2024-06-24T17:07:09Z

@tianyu-l Thanks for the comment. Unfortunately, that script is for converting a llama model in the format it was first uploaded by the llama team. The script thus requires input files like params.json and tokenizer.model, and torchtitan doesn't generate these. What I would like to know is how to go from torchtitan output weights to a HF model. Thank you.

casper-hansen · 2024-10-16T10:04:04Z

An example of how to reload the pretrained weights would be nice once we have the weights in DCP format (e.g. for continued pretraining).

tianyu-l · 2024-10-16T19:00:19Z

An example of how to reload the pretrained weights would be nice once we have the weights in DCP format (e.g. for continued pretraining).

cc: @wz337

rlrs · 2024-10-17T07:55:35Z

An example of how to reload the pretrained weights would be nice once we have the weights in DCP format (e.g. for continued pretraining).

Save the DCP checkpoint as step-0 and it will be loaded at the beginning of training.

soumik-kanad · 2024-10-17T15:31:21Z

@rlrs Thank you so much for the scipt for conversion. But I'm slightly confused about one thing, ie. the need for permutation on my side -

When using your script here (which uses the root and not the original) do we still need to update the apply_rotary_emb() according to this post or not?
Or do we need to download the original llama weights and use the default apply_rotary_emb() function as written in this repo?

casper-hansen · 2024-10-17T18:12:56Z

@soumik-kanad you will have to permute the weights to their original format if you want to use the current implementation.

I would appreciate if the TorchTitan team could show what they think is the best way for continued pretraining that's not hacky. Ideally, you should just be able to load in the original llama torch weights.

tianyu-l · 2024-10-17T20:25:58Z

@casper-hansen

Ideally, you should just be able to load in the original llama torch weights.

Definitely this is some thing we should support. There have been a lot asks on it, but we are still trying to find the bandwidth to work on it. Alternatively, please feel free to make PRs on it, we can help review.

cc: @wz337 @fegin @wconstab @gnadathur

rlrs · 2024-10-17T20:31:56Z

@rlrs Thank you so much for the scipt for conversion. But I'm slightly confused about one thing, ie. the need for permutation on my side -

1. When using your [script here](/~https://github.com/pytorch/torchtitan/issues/305#issuecomment-2129251951) (which uses the root and not the `original`) do we still need to update the `apply_rotary_emb()` according to [this post](/~https://github.com/pytorch/torchtitan/issues/335#issue-2298324053) or not?

2. Or do we need to download the original llama weights and use the default `apply_rotary_emb()` function as written in this repo?

You either use the original weights with the rope implementation as it's implemented in the original llama code (and here in torchtitan), or you use the converted HF weights and the HF rope implementation that you also link to.

It should be relatively easy to change the script I posted to load from the original llama weights. I can do it soon if no one else manages before I get around to it. @tianyu-l or @casper-hansen, how would you want this support in the repo? To me it already seems pretty straighforward, unsure what else is needed.

tianyu-l · 2024-10-18T00:52:25Z

@rlrs It would be great if you can help contribute such a script, to convert checkpoints from original llama weights to DCP format.

I think we can put it under a scripts folder, with input / output directories as args. It might be fine to support converting from the HF Transformer format if the majority of code can be shared. If that's the case we can add an configurable option to choose from original vs. HF.

Besides, we need to create a (short) tutorial (maybe just in /~https://github.com/pytorch/torchtitan/blob/main/docs/checkpoint.md) to illustrate how to convert and load, and possibly add unit tests under the test folder.

jaysonfrancis · 2024-10-19T02:59:39Z

happy to help on this if needed. Also confirming small deltas on KQ weights between original<->hf

Closes pytorch#305. Just wanted to get this out here quickly. The script is very simple since the weights are already in the completely correct format, names and everything. All of the complexity is avoided by not using HF, and so I believe that any functionality relating to HF should be located on their side. However, I do have a DCP -> HF export script which might be useful for some people, in case HF does not have/add one. I'll be happy to add any needed documentation or tests.

tianyu-l added the enhancement New feature or request label May 3, 2024

tianyu-l assigned lessw2020 May 16, 2024

tianyu-l mentioned this issue Jun 25, 2024

Llama models with custom configurations and uploading to Hugging Face #420

Open

weifengpy mentioned this issue Jun 27, 2024

LoRA fine-tuning weights explosion in FSDP training #421

Closed

tianyu-l unassigned lessw2020 Aug 21, 2024

tianyu-l added this to the torchtitan release 1.0 milestone Oct 18, 2024

tianyu-l assigned rlrs Oct 18, 2024

rlrs mentioned this issue Oct 19, 2024

Add script to convert pickled Llama weights to DCP #634

Merged

gnadathur added the release_blocking Issues that are blocking the milestone / release completion label Oct 22, 2024

tianyu-l closed this as completed in #634 Nov 5, 2024

tianyu-l closed this as completed in 3247841 Nov 5, 2024

tianyu-l mentioned this issue Feb 19, 2025

Model init with HuggingFace model #743

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reload existing llama checkpoints #305

reload existing llama checkpoints #305

tianyu-l commented May 3, 2024

Lauler commented May 11, 2024 •

edited

Loading

fegin commented May 13, 2024

tianyu-l commented May 16, 2024

rlrs commented May 21, 2024

tianyu-l commented May 23, 2024

rlrs commented May 24, 2024

kxgong commented Jun 9, 2024

bkchang commented Jun 20, 2024 •

edited

Loading

tianyu-l commented Jun 20, 2024

bkchang commented Jun 24, 2024

casper-hansen commented Oct 16, 2024

tianyu-l commented Oct 16, 2024

rlrs commented Oct 17, 2024

soumik-kanad commented Oct 17, 2024 •

edited

Loading

casper-hansen commented Oct 17, 2024

tianyu-l commented Oct 17, 2024

rlrs commented Oct 17, 2024 •

edited

Loading

tianyu-l commented Oct 18, 2024

jaysonfrancis commented Oct 19, 2024

reload existing llama checkpoints #305

reload existing llama checkpoints #305

Comments

tianyu-l commented May 3, 2024

Lauler commented May 11, 2024 • edited Loading

fegin commented May 13, 2024

tianyu-l commented May 16, 2024

rlrs commented May 21, 2024

tianyu-l commented May 23, 2024

rlrs commented May 24, 2024

kxgong commented Jun 9, 2024

bkchang commented Jun 20, 2024 • edited Loading

tianyu-l commented Jun 20, 2024

bkchang commented Jun 24, 2024

casper-hansen commented Oct 16, 2024

tianyu-l commented Oct 16, 2024

rlrs commented Oct 17, 2024

soumik-kanad commented Oct 17, 2024 • edited Loading

casper-hansen commented Oct 17, 2024

tianyu-l commented Oct 17, 2024

rlrs commented Oct 17, 2024 • edited Loading

tianyu-l commented Oct 18, 2024

jaysonfrancis commented Oct 19, 2024

Lauler commented May 11, 2024 •

edited

Loading

bkchang commented Jun 20, 2024 •

edited

Loading

soumik-kanad commented Oct 17, 2024 •

edited

Loading

rlrs commented Oct 17, 2024 •

edited

Loading