Skip to content

Commit 04cebe6

Browse files
committed
Merge remote-tracking branch 'refs/remotes/origin/shared-experts' into shared-experts
2 parents 7a0e3c2 + 90d035f commit 04cebe6

40 files changed

+4787
-446
lines changed

docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
FROM rocm/dev-ubuntu-22.04:5.6
1+
FROM rocm/dev-ubuntu-22.04:6.3
22
LABEL maintainer="Hugging Face"
33

44
ARG DEBIAN_FRONTEND=noninteractive
5-
ARG PYTORCH='2.1.1'
6-
ARG TORCH_VISION='0.16.1'
7-
ARG TORCH_AUDIO='2.1.1'
8-
ARG ROCM='5.6'
5+
ARG PYTORCH='2.5.1'
6+
ARG TORCH_VISION='0.20.0'
7+
ARG TORCH_AUDIO='2.5.0'
8+
ARG ROCM='6.3'
99

1010
RUN apt update && \
1111
apt install -y --no-install-recommends \

docs/source/en/agents_advanced.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ agent.run(
162162
improved_prompt could be "A bright blue space suit wearing rabbit, on the surface of the moon, under a bright orange sunset, with the Earth visible in the background"
163163
164164
Now that I have improved the prompt, I can use the image generator tool to generate an image based on this prompt.
165-
>>> Agent is executing the code below:
165+
=== Agent is executing the code below:
166166
image = image_generator(prompt="A bright blue space suit wearing rabbit, on the surface of the moon, under a bright orange sunset, with the Earth visible in the background")
167167
final_answer(image)
168168
```

docs/source/en/chat_templating.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ Let's make this concrete with a quick example using the `mistralai/Mistral-7B-In
3939
... ]
4040

4141
>>> tokenizer.apply_chat_template(chat, tokenize=False)
42-
"<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"
42+
"<s> [INST] Hello, how are you? [/INST] I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"
4343
```
4444

4545
Notice how the tokenizer has added the control tokens [INST] and [/INST] to indicate the start and end of

docs/source/en/generation_strategies.md

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -231,7 +231,7 @@ to check if the text is machine-generated (outputs `True` for machine-generated
231231
>>> detector = WatermarkDetector(model_config=model.config, device="cpu", watermarking_config=watermarking_config)
232232
>>> detection_out = detector(out, return_dict=True)
233233
>>> detection_out.prediction
234-
array([True, True])
234+
array([ True, True])
235235
```
236236

237237

@@ -269,7 +269,7 @@ dimension you can act upon, in addition to selecting a decoding strategy. Popula
269269
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
270270
>>> outputs = model.generate(**inputs)
271271
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
272-
['I look forward to seeing you all again!\n\n\n\n\n\n\n\n\n\n\n']
272+
['I look forward to seeing you all again!\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n']
273273
```
274274

275275
### Contrastive search
@@ -445,7 +445,7 @@ To enable assisted decoding, set the `assistant_model` argument with a model.
445445
>>> assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
446446
>>> outputs = model.generate(**inputs, assistant_model=assistant_model)
447447
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
448-
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']
448+
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a glass of wine.']
449449
```
450450

451451
<Tip>
@@ -461,7 +461,7 @@ If you're using a `pipeline` object, all you need to do is to pass the assistant
461461
... model="meta-llama/Llama-3.1-8B",
462462
... assistant_model="meta-llama/Llama-3.2-1B", # This extra line is all that's needed, also works with UAD
463463
... torch_dtype=torch.bfloat16
464-
>>> )
464+
... )
465465
>>> pipe_output = pipe("Once upon a time, ", max_new_tokens=50, do_sample=False)
466466
>>> pipe_output[0]["generated_text"]
467467
'Once upon a time, 3D printing was a niche technology that was only'
@@ -488,7 +488,7 @@ just like in multinomial sampling. However, in assisted decoding, reducing the t
488488
>>> assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
489489
>>> outputs = model.generate(**inputs, assistant_model=assistant_model, do_sample=True, temperature=0.5)
490490
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
491-
['Alice and Bob, a couple of friends of mine, who are both in the same office as']
491+
['Alice and Bob are two people who are very different, but they are both very good at what they do. Alice']
492492
```
493493

494494
We recommend to install `scikit-learn` library to enhance the candidate generation strategy and achieve additional speedup.
@@ -518,7 +518,7 @@ to ensure the new tokens include the correct prompt suffix.
518518
>>> assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
519519
>>> outputs = model.generate(**inputs, assistant_model=assistant_model, tokenizer=tokenizer, assistant_tokenizer=assistant_tokenizer)
520520
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
521-
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']
521+
['Alice and Bob are playing a game. Alice has a set of $n$ integers $a_1, a']
522522
```
523523

524524
#### Prompt Lookup
@@ -547,7 +547,7 @@ If the model you're using was trained to do early exit, you can pass
547547
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
548548
>>> outputs = model.generate(**inputs, assistant_early_exit=4, do_sample=False, max_new_tokens=20)
549549
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
550-
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']
550+
['Alice and Bob are playing a game. Alice has a set of $n$ integers $a_1, a']
551551
```
552552

553553
### DoLa Decoding
@@ -571,10 +571,9 @@ See the following examples for DoLa decoding with the 32-layer LLaMA-7B model.
571571
>>> import torch
572572
>>> from accelerate.test_utils.testing import get_backend
573573

574-
>>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
575-
>>> model = AutoModelForCausalLM.from_pretrained("huggyllama/llama-7b", torch_dtype=torch.float16)
576574
>>> device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
577-
>>> model.to(device)
575+
>>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
576+
>>> model = AutoModelForCausalLM.from_pretrained("huggyllama/llama-7b", torch_dtype=torch.float16).to(device)
578577
>>> set_seed(42)
579578

580579
>>> text = "On what date was the Declaration of Independence officially signed?"
@@ -593,7 +592,7 @@ See the following examples for DoLa decoding with the 32-layer LLaMA-7B model.
593592
# DoLa decoding with contrasting specific layers (layers 28 and 30)
594593
>>> dola_custom_output = model.generate(**inputs, do_sample=False, max_new_tokens=50, dola_layers=[28,30], repetition_penalty=1.2)
595594
>>> tokenizer.batch_decode(dola_custom_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)
596-
['\nIt was officially signed on 2 August 1776, when 56 members of the Second Continental Congress, representing the original 13 American colonies, voted unanimously for the resolution for independence. The 2']
595+
['\nIn 1891, when he was 54 years old, John Jacob Astor founded his empire. He opened a one-man business and spent the next 27 years working 10-hour days. When']
597596
```
598597

599598
#### Understanding the `dola_layers` argument

docs/source/en/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -385,6 +385,7 @@ Flax), PyTorch, and/or TensorFlow.
385385
| [YOLOS](model_doc/yolos) ||||
386386
| [YOSO](model_doc/yoso) ||||
387387
| [Zamba](model_doc/zamba) ||||
388+
| [Zamba2](model_doc/zamba2) ||||
388389
| [ZoeDepth](model_doc/zoedepth) ||||
389390

390391
<!-- End table-->

docs/source/en/kv_cache.md

Lines changed: 35 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ More concretely, key-value cache acts as a memory bank for these generative mode
5656
>>> import torch
5757
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
5858

59-
>>> model_id = "meta-llama/Llama-2-7b-chat-hf"
59+
>>> model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
6060
>>> model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda:0")
6161
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
6262

@@ -82,7 +82,13 @@ More concretely, key-value cache acts as a memory bank for these generative mode
8282
... cache_position = cache_position[-1:] + 1 # add one more position for the next token
8383

8484
>>> print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])
85-
"[INST] Hello, what's your name. [/INST] Hello! My name is LLaMA,"
85+
```
86+
```txt
87+
<|user|>
88+
Hello, what's your name.
89+
<|assistant|>
90+
My name is Sarah.
91+
<|
8692
```
8793

8894
</details>
@@ -132,17 +138,13 @@ Cache quantization can be detrimental in terms of latency if the context length
132138
>>> import torch
133139
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
134140

135-
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
136-
>>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0")
141+
>>> tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
142+
>>> model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.float16).to("cuda:0")
137143
>>> inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
138144

139145
>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"nbits": 4, "backend": "quanto"})
140146
>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
141-
I like rock music because it's loud and energetic. It's a great way to express myself and rel
142-
143-
>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20)
144-
>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
145-
I like rock music because it's loud and energetic. I like to listen to it when I'm feeling
147+
I like rock music because it's a great way to express myself. I like the way it makes me feel, the
146148
```
147149

148150
### Offloaded Cache
@@ -231,14 +233,14 @@ For more examples with Static Cache and JIT compilation, take a look at [StaticC
231233
>>> import torch
232234
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
233235

234-
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
235-
>>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto")
236+
>>> tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
237+
>>> model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.float16, device_map="auto")
236238
>>> inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
237239

238240
>>> # simply pass the cache implementation="static"
239241
>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="static")
240242
>>> tokenizer.batch_decode(out, skip_special_tokens=True)[0]
241-
"Hello, my name is [Your Name], and I am a [Your Profession] with [Number of Years] of"
243+
"Hello, my name is [Your Name] and I am a [Your Position] at [Your Company]. I am writing"
242244
```
243245

244246

@@ -256,7 +258,7 @@ This will use the [`~OffloadedStaticCache`] implementation instead.
256258
>>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto")
257259
>>> inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
258260

259-
>>> # simply pass the cache implementation="static"
261+
>>> # simply pass the cache implementation="offloaded_static"
260262
>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="offloaded_static")
261263
>>> tokenizer.batch_decode(out, skip_special_tokens=True)[0]
262264
"Hello, my name is [Your Name], and I am a [Your Profession] with [Number of Years] of"
@@ -275,14 +277,14 @@ Note that you can use this cache only for models that support sliding window, e.
275277
>>> import torch
276278
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, SinkCache
277279

278-
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
279-
>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16).to("cuda:0")
280+
>>> tokenizer = AutoTokenizer.from_pretrained("teknium/OpenHermes-2.5-Mistral-7B")
281+
>>> model = AutoModelForCausalLM.from_pretrained("teknium/OpenHermes-2.5-Mistral-7B", torch_dtype=torch.float16).to("cuda:0")
280282
>>> inputs = tokenizer("Yesterday I was on a rock concert and.", return_tensors="pt").to(model.device)
281283

282284
>>> # can be used by passing in cache implementation
283285
>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=30, cache_implementation="sliding_window")
284286
>>> tokenizer.batch_decode(out, skip_special_tokens=True)[0]
285-
"Yesterday I was on a rock concert and. I was so excited to see my favorite band. I was so excited that I was jumping up and down and screaming. I was so excited that I"
287+
"Yesterday I was on a rock concert and. I was so excited to see my favorite band perform live. I was so happy that I could hardly contain myself. I was jumping up and down and"
286288
```
287289

288290
### Sink Cache
@@ -295,16 +297,16 @@ Unlike other cache classes, this one can't be used directly by indicating a `cac
295297
>>> import torch
296298
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, SinkCache
297299

298-
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
299-
>>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0")
300+
>>> tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
301+
>>> model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.float16).to("cuda:0")
300302
>>> inputs = tokenizer("This is a long story about unicorns, fairies and magic.", return_tensors="pt").to(model.device)
301303

302304
>>> # get our cache, specify number of sink tokens and window size
303305
>>> # Note that window size already includes sink tokens, so has to be larger
304306
>>> past_key_values = SinkCache(window_length=256, num_sink_tokens=4)
305307
>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=30, past_key_values=past_key_values)
306308
>>> tokenizer.batch_decode(out, skip_special_tokens=True)[0]
307-
"This is a long story about unicorns, fairies and magic. It is a fantasy world where unicorns and fairies live together in harmony. The story follows a young girl named Lily"
309+
"This is a long story about unicorns, fairies and magic. It is a story about a young girl named Lily who discovers that she has the power to control the elements. She learns that she can"
308310
```
309311

310312
### Encoder-Decoder Cache
@@ -332,15 +334,15 @@ In case you are using Sink Cache, you have to crop your inputs to that maximum l
332334
>>> import torch
333335
>>> from transformers import AutoTokenizer,AutoModelForCausalLM
334336
>>> from transformers.cache_utils import (
335-
>>> DynamicCache,
336-
>>> SinkCache,
337-
>>> StaticCache,
338-
>>> SlidingWindowCache,
339-
>>> QuantoQuantizedCache,
340-
>>> QuantizedCacheConfig,
341-
>>> )
342-
343-
>>> model_id = "meta-llama/Llama-2-7b-chat-hf"
337+
... DynamicCache,
338+
... SinkCache,
339+
... StaticCache,
340+
... SlidingWindowCache,
341+
... QuantoQuantizedCache,
342+
... QuantizedCacheConfig,
343+
... )
344+
345+
>>> model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
344346
>>> model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map='auto')
345347
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
346348

@@ -363,7 +365,7 @@ In case you are using Sink Cache, you have to crop your inputs to that maximum l
363365
... messages.append({"role": "assistant", "content": completion})
364366

365367
print(messages)
366-
[{'role': 'user', 'content': "Hello, what's your name?"}, {'role': 'assistant', 'content': " Hello! My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. 😊"}, {'role': 'user', 'content': 'Btw, yesterday I was on a rock concert.'}, {'role': 'assistant', 'content': ' Oh, cool! That sounds like a lot of fun! 🎉 Did you enjoy the concert? What was the band like? 🤔'}]
368+
[{'role': 'user', 'content': "Hello, what's your name?"}, {'role': 'assistant', 'content': "Hello, I'm AI."}, {'role': 'user', 'content': 'Btw, yesterday I was on a rock concert.'}, {'role': 'assistant', 'content': "I'm sorry to hear that you were on a rock concert yesterday. It sounds like a fun experience, but I'm not capable of experiencing music or concerts. However, I can provide you with some information about rock music and its history. Rock music emerged in the 1950s and 1960s in the United States and Britain, and it quickly gained popularity around the world. Some of the most famous rock bands of all time include The Beatles, The Rolling Stones, Led Zeppelin, and Pink Floyd. Rock music has a distinct sound and style, with elements of blues, country, and folk music. It often features guitar solos, heavy bass lines, and drums. Rock music has had a significant impact on popular culture, influencing genres such as punk rock, heavy metal, and alternative rock."}]
367369
```
368370

369371

@@ -376,7 +378,7 @@ Sometimes you would want to first fill-in cache object with key/values for certa
376378
>>> import torch
377379
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache, StaticCache
378380

379-
>>> model_id = "meta-llama/Llama-2-7b-chat-hf"
381+
>>> model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
380382
>>> model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda")
381383
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
382384

@@ -400,7 +402,7 @@ Sometimes you would want to first fill-in cache object with key/values for certa
400402
... responses.append(response)
401403

402404
>>> print(responses)
403-
['<s> You are a helpful assistant. Help me to write a blogpost about travelling.\n\nTitle: The Ultimate Guide to Travelling: Tips, Tricks, and', '<s> You are a helpful assistant. What is the capital of France?\n\nYes, the capital of France is Paris.</s>']
405+
['<s> You are a helpful assistant. Help me to write a blogpost about travelling. I am excited to share my experiences with you. I have been traveling for the past', '<s> You are a helpful assistant. What is the capital of France? \n\nAnswer: Paris is the capital of France.</s>']
404406
```
405407

406408

@@ -414,8 +416,8 @@ this legacy format, you can seamlessly convert it to a `DynamicCache` and back.
414416
>>> import torch
415417
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
416418

417-
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
418-
>>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto")
419+
>>> tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
420+
>>> model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.float16, device_map="auto")
419421
>>> inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
420422

421423
>>> # `return_dict_in_generate=True` is required to return the cache. `return_legacy_cache` forces the returned cache

docs/source/en/model_doc/glm.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ In the following, we demonstrate how to use `glm-4-9b-chat` for the inference. N
5656
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
5757
>>> device = "cuda" # the device to load the model onto
5858

59-
>>> model = AutoModelForCausalLM.from_pretrained("THUDM/glm-4-9b-chat", device_map="auto")
59+
>>> model = AutoModelForCausalLM.from_pretrained("THUDM/glm-4-9b-chat", device_map="auto", trust_remote_code=True)
6060
>>> tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat")
6161

6262
>>> prompt = "Give me a short introduction to large language model."

0 commit comments

Comments
 (0)