Convert Orbax ckpt to HuggingFace #581

A9isha · 2024-04-09T21:22:47Z

No description provided.

rwitten

Change LGTMs but without a test I'm skeptical. I don't think this needs to be tested as exhaustively but how could we test it somewhat?

A9isha · 2024-04-11T19:08:53Z

Change LGTMs but without a test I'm skeptical. I don't think this needs to be tested as exhaustively but how could we test it somewhat?

@rwitten I have tested locally but were you thinking running this at a nightly cadence?

thiagolaitz · 2024-05-06T18:17:33Z

@A9isha Hello, do you by any chance have a script that does the opposite, converting HF to Orbax?

A9isha · 2024-05-06T18:39:08Z

@A9isha Hello, do you by any chance have a script that does the opposite, converting HF to Orbax?

We have the script llama_or_mistral_ckpt.py to convert the original PyTorch Llama2 checkpoint that Meta provides into MaxText checkpoint.

You can see the usage here for Llama2-7b for e.g.

hxssgaa · 2024-05-10T08:24:26Z

Hi @A9isha , I found two bugs in your conversion code, and I have fixed it and validated the weights converted from maxtext version of llama3-8b with the HF one.

First one is the unpermute function is wrong, the original maxtext ckpt script used a step size of 2, the correct way is to stack the odd and even tensors and reshape it:

def unpermute_from_match_maxtext_rope(arr):
  """
  Function to get the RoPE values in correct ordering
  """
  split_size = arr.shape[-1] // 2  # Assuming half for evens, half for odds
  evens = arr[..., :split_size]
  odds = arr[..., split_size:]
  return jax.numpy.stack([evens, odds], axis=len(arr.shape)).reshape(arr.shape)

Second bug is related to Q and K, I understand it's easy to make mistakes here because both original LLaMA, LLaMA-HF and maxtext stored the tensor differently, the correct way is to do following by reversing first to original LLaMA weight then to HF weight:

    hf_model_params[f"model.layers.{layer_int}.self_attn.q_proj.weight"] = torch.tensor(np.asarray(
        unpermute_from_match_maxtext_rope(
          reverse_scale(
            training_state.params['params']["decoder"]["layers"]["self_attention"]["query"]["kernel"][:, layer_int, :, :]
            ,head_dim
            )
        )),
        dtype=torch.bfloat16
    )
    hf_model_params[f"model.layers.{layer_int}.self_attn.q_proj.weight"] = hf_model_params[f"model.layers.{layer_int}.self_attn.q_proj.weight"].view(base_num_query_heads * head_dim, base_num_query_heads * head_dim).T.view(base_num_query_heads, head_dim // 2, 2, base_num_query_heads * head_dim).transpose(1, 2).reshape(-1, base_num_query_heads * head_dim)

    hf_model_params[f"model.layers.{layer_int}.self_attn.k_proj.weight"] = torch.tensor(np.asarray(
        unpermute_from_match_maxtext_rope(
          training_state.params['params']["decoder"]["layers"]["self_attention"]["key"]["kernel"][:, layer_int, :, :]
          )),
        dtype=torch.bfloat16
    )
    hf_model_params[f"model.layers.{layer_int}.self_attn.k_proj.weight"] = hf_model_params[f"model.layers.{layer_int}.self_attn.k_proj.weight"].view(base_num_query_heads * head_dim, base_num_kv_heads * head_dim).T.reshape(base_num_kv_heads, head_dim // 2, 2, base_num_query_heads * head_dim).transpose(1, 2).reshape(-1 ,base_num_query_heads * head_dim)

peregilk · 2024-09-08T10:26:22Z

I think this script is fine, and I have been using it quite a lot. It should be updated for Llama3.1 though (whenever that is merged). And maybe also the 70B models?

    """
    Load the model that we are interested in from HuggingFace
    """
    if model_size == "llama2-7b":
        model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
    elif model_size == "llama3-8b":
        model = LlamaForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
    elif model_size == "llama3.1-8b":
        model = LlamaForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
    elif model_size == "mistral-7b":
        model = MistralForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
    else:
        raise NotImplementedError

    return model

Any chance this can be merged @A9isha ?

A9isha force-pushed the anisha-orbax2hf branch from 9081d1f to f202cad Compare April 9, 2024 21:43

Convert Orbax ckpt to HuggingFace

19c5c9d

A9isha force-pushed the anisha-orbax2hf branch from f202cad to 19c5c9d Compare April 10, 2024 00:20

A9isha assigned rwitten Apr 10, 2024

A9isha marked this pull request as ready for review April 10, 2024 00:41

A9isha requested review from rwitten and gobbleturk as code owners April 10, 2024 00:41

A9isha assigned gobbleturk Apr 10, 2024

A9isha mentioned this pull request Apr 10, 2024

Converting checkpoints #551

Closed

rwitten requested changes Apr 11, 2024

View reviewed changes

rwitten removed their assignment Apr 11, 2024

A9isha assigned rwitten Apr 11, 2024

rwitten removed their assignment Apr 11, 2024

salrowili mentioned this pull request Aug 16, 2024

converting Gemma maxtext compatible checkpoint to Hugging Face format #829

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert Orbax ckpt to HuggingFace #581

Convert Orbax ckpt to HuggingFace #581

A9isha commented Apr 9, 2024

rwitten left a comment

A9isha commented Apr 11, 2024

thiagolaitz commented May 6, 2024 •

edited

Loading

A9isha commented May 6, 2024

hxssgaa commented May 10, 2024

peregilk commented Sep 8, 2024 •

edited

Loading

Convert Orbax ckpt to HuggingFace #581

Are you sure you want to change the base?

Convert Orbax ckpt to HuggingFace #581

Conversation

A9isha commented Apr 9, 2024

rwitten left a comment

Choose a reason for hiding this comment

A9isha commented Apr 11, 2024

thiagolaitz commented May 6, 2024 • edited Loading

A9isha commented May 6, 2024

hxssgaa commented May 10, 2024

peregilk commented Sep 8, 2024 • edited Loading

thiagolaitz commented May 6, 2024 •

edited

Loading

peregilk commented Sep 8, 2024 •

edited

Loading