Add OLMo model family #29890

2015aroras · 2024-03-27T02:11:12Z

What does this PR do?

This PR adds the OLMo model family to transformers. A base OLMoModel and a casual LM OLMoForCausalLM are implemented. The models are already present in HF Hub (e.g. allenai/OLMo-7B), and this implementation is compatible with the checkpoints in the Hub.

UPDATE: The current version of the PR is not compatible with the old OLMo models in HF Hub. New models have been uploaded to HF Hub (e.g. allenai/OLMo-7B-hf) to support this PR.

Fixes #29885

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…ormat

2015aroras · 2024-03-27T03:00:33Z

@ArthurZucker and @younesbelkada for general guidance and/or review, since this is a PyTorch text model.

ArthurZucker

Hey! Thanks for submitting the PR!
The code seems in a very research state, let's try to get it to transformers philosophy.
My main query would be: what are the main differences with MixtralMoe ? (Llama + MoE) ? Only include differences for trained checkpoints (I see both alibi and rotary, all of them use both? Or only one was kept?)

The fast tokenizer also looks very similar to GptNeoX / Bloom one, any reason to have a new one?

2015aroras · 2024-03-27T17:45:56Z

Hey! Thanks for submitting the PR! The code seems in a very research state, let's try to get it to transformers philosophy. My main query would be: what are the main differences with MixtralMoe ? (Llama + MoE) ? Only include differences for trained checkpoints (I see both alibi and rotary, all of them use both? Or only one was kept?)

The fast tokenizer also looks very similar to GptNeoX / Bloom one, any reason to have a new one?

Thanks for having a look!

In a general sense, I tried to stay as faithful to OLMo as I could while following general transformers principles (all model code in 1 file, use "Copied from" with re-usable logic, etc.).

Regarding tokenization, OLMo adds an eos token to the end of inputs that don't have them. The GptNeoX and Bloom ones don't. Llama's tokenizer does, but it also has a slow tokenizer and I couldn't figure out how to convert the fast OLMo tokenizer to a slow sentencepiece one. My current implementation copies a lot from the Llama fast tokenizer.

In terms of the model, OLMo doesn't have MoE whereas MixtralMoe does. I would have to figure out if I could use MixtralMoe while avoiding the MoE functionality. OLMo is more directly map-able to Llama; the model architectures mostly map 1 to 1, and the main discrepancies in output come from the 2 models using different precision for some operations.

The official OLMo models use rope and at most 1 of rope and alibi is used in a run (i.e. no alibi), but we hope to release checkpoints from our experiments at some point. Some of those would use alibi. I think it would be nice if transformers could support those checkpoints too.

@dirkgr FYI

ArthurZucker · 2024-03-28T02:01:09Z

Hey! Sorry I might have misunderstood (thought I saw some MoE things here).
Alright when I mean research I mean there are a lot of things that we never do in transformers.

Tokenizer, we can update bloom / Gpneox to support this option as IMO it makes a lot of sense! No need for a slow one 😉
The modeling code does not seem to include a lot of # Copied from, would recommend you to look at Gemma modeling for a good example! Basically we want to avoid all the autocasting code when it is not necessary. we do checkpointing a bit differently and we want to have the least number of code paths as possible.
For the config, all of the different enums for layer norms and act fn, we already have the from ...activations import ACT2FN and the ALL_LAYER_NORMS which should be used.
BufferCache is not really something we use, and I am not sure we had issue with the community after the last Llama fixes!
k_norm and q_norm are used in Persimmon so a good place to copy from as well!
_scaled_dot_product_attention should follow the classing we use! So xxxxSdpaAttention
if clip_qkv is always true let's not have a if else!
attention mask logic should use the _update_causal_mask from Llama and have a similar logic
We don't use reset_parameters or device let's try to get closer to Llama!
We don't incude get_fsdp_wrap_policy and fsdp specifics in transformers
Let's add some generation tests!

2015aroras · 2024-03-28T06:57:30Z

Just want to clarify my thoughts before I try making changes, so I'm expressing my understanding of what I should do below. Please correct me where I am wrong.

Regarding the tokenizer, if possible I should update one of the existing tokenizers and then use that tokenizer directly (similar to how mamba appears to use GPTNeoXTokenizerFast).

The comments regarding modeling code suggest that I should aim to re-use existing code using #Copied from and any existing functions. More generally, I should aim to be more "transformers-like" than faithful to the original code. I should try to make the code run like OLMo (produce the same output) but look like transformers.

Assuming it were possible, the ideal solution would be for me to use some existing model and convert OLMo checkpoints to work with that model.

Regarding copying existing code vs being faithful, what should I do when some operation (say, RoPE) exists in transformers but is performed in a different precision (e.g. float32 vs bfloat16) in OLMo compared to the transformer code? Do I use the transformers code and disregard the "small" error, or write separate code that matches the original OLMo behavior?

ArthurZucker · 2024-03-28T07:50:22Z

Assuming it were possible, the ideal solution would be for me to use some existing model and convert OLMo checkpoints to work with that model.

Yes! This way it's easier for all the community to use the model, and keeps the diff very clear and clean!

For the tokenizer, if GPTNeoXTokenizerFast does not have add_eos or add_bos, we can add it but in a backward compatible way!

The comments regarding modeling code suggest that I should aim to re-use existing code using #Copied from and any existing functions. More generally, I should aim to be more "transformers-like" than faithful to the original code. I should try to make the code run like OLMo (produce the same output) but look like transformers.

Exactly 💯

For ROPE, we should always compute it in float32 that is from experiences and long debugging.
Otherwise you can use the code to match original Olmo behavior where it is needed of course! We aim for 1e-4 logits equivalence!

molbap · 2024-04-13T07:12:59Z

src/transformers/models/olmo/configuration_olmo.py

+        initializer_range=0.02,
+        use_cache=True,
+        pad_token_id=1,
+        bos_token_id=None,


is it wanted that default config do not set the bos_token_id?

OLMo's config class (in the original Github) doesn't have a bos token and so no OLMo checkpoints use bos tokens to my knowledge. Thus setting no bos_token_id is intentional.

molbap · 2024-04-13T07:20:09Z

I think the PR is in a good state. Thanks for adding this! pinging @ArthurZucker for a final review :)

This reverts commit 4df56a4.

This reverts commit 9ff65a4.

…g end quotes

2015aroras · 2024-04-15T23:38:51Z

@molbap @ArthurZucker All checks are passing! Now it's just a matter of reviews

ArthurZucker

Good state!

camel casing needs to be used on a ll classes!
the main diff is clamping, it should not be an if else, because else it's a llama model 🤗

ArthurZucker · 2024-04-16T20:12:06Z

src/transformers/models/auto/configuration_auto.py

@@ -174,6 +174,7 @@
        ("nllb-moe", "NllbMoeConfig"),
        ("nougat", "VisionEncoderDecoderConfig"),
        ("nystromformer", "NystromformerConfig"),
+        ("olmo", "OLMoConfig"),


Suggested change

("olmo", "OLMoConfig"),

("olmo", "OlmoConfig"),

we need camel casing on all classes

src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py

ArthurZucker · 2024-04-16T20:14:20Z

src/transformers/models/llama/modeling_llama.py

let's not modify an unrelated file as important as Llama! Remove the copied from and for the forward only! / use # Ignore copy

ArthurZucker · 2024-04-16T20:14:49Z

src/transformers/models/olmo/configuration_olmo.py

+        pretraining_tp (`int`, *optional*, defaults to 1):
+            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
+            document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to understand more about it. This value is
+            necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
+            issue](https://github.com/pytorch/pytorch/issues/76232).


Let's get rid of that no?