Skip to content

Commit

Permalink
Merge pull request #1 from huggingface/main
Browse files Browse the repository at this point in the history
merge hf main
  • Loading branch information
rishiraj authored Sep 6, 2024
2 parents e30446f + 6ff6069 commit 92192e8
Show file tree
Hide file tree
Showing 132 changed files with 6,345 additions and 590 deletions.
3 changes: 1 addition & 2 deletions SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,5 +36,4 @@ Please inspect the code of the tools before passing them to the Agent to protect

## Reporting a Vulnerability

🤗 Please feel free to submit vulnerability reports to our private bug bounty program at https://hackerone.com/hugging_face. You'll need to request access to the program by emailing security@huggingface.co.
Note that you'll need to be invited to our program, so send us a quick email at security@huggingface.co if you've found a vulnerability.
Feel free to submit vulnerability reports to [security@huggingface.co](mailto:security@huggingface.co), where someone from the HF security team will review and recommend next steps. If reporting a vulnerability specific to open source, please note [Huntr](https://huntr.com) is a vulnerability disclosure program for open source software.
2 changes: 1 addition & 1 deletion docker/consistency.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/transforme
RUN git lfs install

RUN pip uninstall -y transformers
RUN apt-get clean && rm -rf /var/lib/apt/lists/* && apt-get autoremove && apt-get autoclean
RUN apt-get clean && rm -rf /var/lib/apt/lists/* && apt-get autoremove && apt-get autoclean
4 changes: 2 additions & 2 deletions docker/torch-light.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@ RUN apt-get update && apt-get install -y --no-install-recommends libsndfile1-de
ENV UV_PYTHON=/usr/local/bin/python
RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
RUN pip install --no-cache-dir 'torch' 'torchvision' 'torchaudio' --index-url https://download.pytorch.org/whl/cpu
RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing]"
RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing,tiktoken]"
RUN pip uninstall -y transformers
4 changes: 4 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,8 @@
title: Troubleshoot
- local: gguf
title: Interoperability with GGUF files
- local: tiktoken
title: Interoperability with TikToken files
title: Developer guides
- sections:
- local: quantization/overview
Expand Down Expand Up @@ -836,6 +838,8 @@
title: LLaVA-NeXT
- local: model_doc/llava_next_video
title: LLaVa-NeXT-Video
- local: model_doc/llava_onevision
title: LLaVA-Onevision
- local: model_doc/lxmert
title: LXMERT
- local: model_doc/matcha
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/community.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,3 +67,4 @@ This page regroups resources around 🤗 Transformers developed by the community
| [Detect objects in an image with DETR](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) | How to use a trained *DetrForObjectDetection* model to detect objects in an image and visualize attention | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) |
| [Fine-tune DETR on a custom object detection dataset](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) | How to fine-tune *DetrForObjectDetection* on a custom object detection dataset | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) |
| [Finetune T5 for Named Entity Recognition](https://github.com/ToluClassics/Notebooks/blob/main/T5_Ner_Finetuning.ipynb) | How to fine-tune *T5* on a Named Entity Recognition Task | [Ogundepo Odunayo](https://github.com/ToluClassics) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1obr78FY_cBmWY5ODViCmzdY6O1KB65Vc?usp=sharing) |
| [Fine-Tuning Open-Source LLM using QLoRA with MLflow and PEFT](https://github.com/mlflow/mlflow/blob/master/docs/source/llms/transformers/tutorials/fine-tuning/transformers-peft.ipynb) | How to use [QLoRA](https://github.com/artidoro/qlora) and [PEFT](https://huggingface.co/docs/peft/en/index) to fine-tune an LLM in a memory-efficient way, while using [MLflow](https://mlflow.org/docs/latest/llms/transformers/index.html) to manage experiment tracking | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlflow/mlflow/blob/master/docs/source/llms/transformers/tutorials/fine-tuning/transformers-peft.ipynb) |
1 change: 1 addition & 0 deletions docs/source/en/gguf.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ For now the supported model architectures are the architectures that have been v
- LLaMa
- Mistral
- Qwen2
- Qwen2Moe

## Example usage

Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,7 @@ Flax), PyTorch, and/or TensorFlow.
| [LLaVa](model_doc/llava) ||||
| [LLaVA-NeXT](model_doc/llava_next) ||||
| [LLaVa-NeXT-Video](model_doc/llava_next_video) ||||
| [LLaVA-Onevision](model_doc/llava_onevision) ||||
| [Longformer](model_doc/longformer) ||||
| [LongT5](model_doc/longt5) ||||
| [LUKE](model_doc/luke) ||||
Expand Down
20 changes: 10 additions & 10 deletions docs/source/en/kv_cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,11 @@ More concretely, key-value cache acts as a memory bank for these generative mode


See an example below for how to implement your own generation loop.

```python
>>> import torch
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache

>>> model_id = "meta-llama/Llama-2-7b-chat-hf"
>>> model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda:0")
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
Expand All @@ -69,10 +69,10 @@ More concretely, key-value cache acts as a memory bank for these generative mode
>>> max_new_tokens = 10

>>> for _ in range(max_new_tokens):
... outputs = model(**inputs, cache_position=cache_position, past_key_values=past_key_values, use_cache=True)
... outputs = model(**inputs, cache_position=cache_position, past_key_values=past_key_values, use_cache=True)
... # Greedily sample one next token
... next_token_ids = outputs.logits[:, -1:].argmax(-1)
... generated_ids = torch.cat([generated_ids, next_token_ids], dim=-1)
... generated_ids = torch.cat([generated_ids, next_token_ids], dim=-1)
...
... # Prepare inputs for the next generation step by leaaving unprocessed tokens, in our case we have only one new token
... # and expanding attn mask for the new token, as explained above
Expand Down Expand Up @@ -222,7 +222,7 @@ before successfully generating 40 beams.

### Static Cache

Since the "DynamicCache" dynamically grows with each generation step, it prevents you from taking advantage of JIT optimizations. The [`~StaticCache`] pre-allocates
Since the "DynamicCache" dynamically grows with each generation step, it prevents you from taking advantage of JIT optimizations. The [`~StaticCache`] pre-allocates
a specific maximum size for the keys and values, allowing you to generate up to the maximum length without having to modify cache size. Check the below usage example.

For more examples with Static Cache and JIT compilation, take a look at [StaticCache & torchcompile](./llm_optims#static-kv-cache-and-torchcompile)
Expand Down Expand Up @@ -267,7 +267,7 @@ This will use the [`~OffloadedStaticCache`] implementation instead.

As the name suggests, this cache type implements a sliding window over previous keys and values, retaining only the last `sliding_window` tokens. It should be used with models like Mistral that support sliding window attention. Additionally, similar to Static Cache, this one is JIT-friendly and can be used with the same compile tecniques as Static Cache.

Note that you can use this cache only for models that support sliding window, e.g. Mistral models.
Note that you can use this cache only for models that support sliding window, e.g. Mistral models.


```python
Expand Down Expand Up @@ -324,7 +324,7 @@ We have seen how to use each of the cache types when generating. What if you wan

The general format when doing iterative generation is as below. First you have to initialize an empty cache of the type you want, and you can start feeding in new prompts iteratively. Keeping track of dialogues history and formatting can be done with chat templates, read more on that in [chat_templating](./chat_templating)

In case you are using Sink Cache, you have to crop your inputs to that maximum length because Sink Cache can generate text longer than its maximum window size, but it expects the first input to not exceed the maximum cache length.
In case you are using Sink Cache, you have to crop your inputs to that maximum length because Sink Cache can generate text longer than its maximum window size, but it expects the first input to not exceed the maximum cache length.


```python
Expand Down Expand Up @@ -354,9 +354,9 @@ In case you are using Sink Cache, you have to crop your inputs to that maximum l
... inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(model.device)
... if isinstance(past_key_values, SinkCache):
... inputs = {k: v[:, -max_cache_length:] for k, v in inputs.items()}
...
...
... input_length = inputs["input_ids"].shape[1]
...
...
... outputs = model.generate(**inputs, do_sample=False, max_new_tokens=256, past_key_values=past_key_values)
... completion = tokenizer.decode(outputs[0, input_length: ], skip_special_tokens=True)
... messages.append({"role": "assistant", "content": completion})
Expand Down Expand Up @@ -400,4 +400,4 @@ Sometimes you would want to first fill-in cache object with key/values for certa

>>> print(responses)
['<s> You are a helpful assistant. Help me to write a blogpost about travelling.\n\nTitle: The Ultimate Guide to Travelling: Tips, Tricks, and', '<s> You are a helpful assistant. What is the capital of France?\n\nYes, the capital of France is Paris.</s>']
```
```
2 changes: 1 addition & 1 deletion docs/source/en/llm_optims.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ This guide will show you how to use the optimization techniques available in Tra

During decoding, a LLM computes the key-value (kv) values for each input token and since it is autoregressive, it computes the same kv values each time because the generated output becomes part of the input now. This is not very efficient because you're recomputing the same kv values each time.

To optimize this, you can use a kv-cache to store the past keys and values instead of recomputing them each time. However, since the kv-cache grows with each generation step and is dynamic, it prevents you from taking advantage of [`torch.compile`](./perf_torch_compile), a powerful optimization tool that fuses PyTorch code into fast and optimized kernels.
To optimize this, you can use a kv-cache to store the past keys and values instead of recomputing them each time. However, since the kv-cache grows with each generation step and is dynamic, it prevents you from taking advantage of [`torch.compile`](./perf_torch_compile), a powerful optimization tool that fuses PyTorch code into fast and optimized kernels. We have an entire guide dedicated to kv-caches [here](./kv_cache).

The *static kv-cache* solves this issue by pre-allocating the kv-cache size to a maximum value which allows you to combine it with `torch.compile` for up to a 4x speed up. Your speed up may vary depending on the model size (larger models have a smaller speed up) and hardware.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/llm_tutorial_optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -662,7 +662,7 @@ Using the key-value cache has two advantages:
- Significant increase in computational efficiency as less computations are performed compared to computing the full \\( \mathbf{QK}^T \\) matrix. This leads to an increase in inference speed
- The maximum required memory is not increased quadratically with the number of generated tokens, but only increases linearly.

> One should *always* make use of the key-value cache as it leads to identical results and a significant speed-up for longer input sequences. Transformers has the key-value cache enabled by default when making use of the text pipeline or the [`generate` method](https://huggingface.co/docs/transformers/main_classes/text_generation).
> One should *always* make use of the key-value cache as it leads to identical results and a significant speed-up for longer input sequences. Transformers has the key-value cache enabled by default when making use of the text pipeline or the [`generate` method](https://huggingface.co/docs/transformers/main_classes/text_generation). We have an entire guide dedicated to caches [here](./kv_cache).
<Tip warning={true}>

Expand Down
Loading

0 comments on commit 92192e8

Please sign in to comment.