Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 7 additions & 13 deletions docs/source/en/cache_explanation.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,24 +82,18 @@ When you use Transformers' [`Cache`] class, the self-attention module performs s

## Cache storage implementation

The actual storage of key-value pairs varies between cache implementations. As an example, consider the [`DynamicCache`].
Caches are structured as a list of layers, where each layer contains a key and value cache. The key and value caches are tensors with the shape `[batch_size, num_heads, seq_len, head_dim]`.

Layers can be of different types (e.g. `DynamicLayer`, `StaticLayer`, `SlidingWindowLayer`), which mostly changes how sequence length is handled and how the cache is updated.

In [`DynamicCache`], the key-value pairs are stored as two lists of tensors. Each tensor in the lists have the shape `[batch_size, num_heads, seq_len, head_dim]`.
- `key_cache`: A list of tensors, one for each layer.
- `value_cache`: A list of tensors, one for each layer.
The simplest is a `DynamicLayer` that grows as more tokens are processed. The sequence length dimension (`seq_len`) increases with each new token:

When new tokens are processed:

1. For each layer, the new key and value states are concatenated with the existing cache.
```py
self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2)
self.value_cache[layer_idx] = torch.cat([self.value_cache[layer_idx], value_states], dim=-2)
cache.layers[idx].keys = torch.cat([cache.layers[idx].keys, key_states], dim=-2)
cache.layers[idx].values = torch.cat([cache.layers[idx].values, value_states], dim=-2)
```

2. The cache grows dynamically as more tokens are processed. The sequence length dimension (`seq_len`) increases with each new token.

3. The cache maintains a count of seen tokens through `self._seen_tokens`. This is updated when the first layer processes a new token.
Other layers like `StaticLayer` and `SlidingWindowLayer` have a fixed sequence length that is set when the cache is created. This makes them compatible with `torch.compile`. In the case of `SlidingWindowLayer`, existing tokens are shifted out of the cache when a new token is added.

The example below demonstrates how to create a generation loop with [`DynamicCache`]. As discussed, the attention mask is a concatenation of past and current token values and `1` is added to the cache position for the next token.

Expand Down Expand Up @@ -143,7 +137,7 @@ The legacy format is essentially the same data structure but organized different
- The tensors have the same shape `[batch_size, num_heads, seq_len, head_dim]`.
- The format is less flexible and doesn't support features like quantization or offloading.

If your project depends on this legacy format, you can convert between [`DynamicCache`] and a tuple of tuples as shown below with the [`~DynamicCache.from_legacy_cache`] and [`DynamicCache.to_legacy_cache`] functions. This is helpful if you have custom logic for manipulating a cache in a specific format.
If your project depends on this legacy format, you can convert between [`DynamicCache`] and a tuple of tuples as shown below with the [`~Cache.from_legacy_cache`] and [`Cache.to_legacy_cache`] functions. This is helpful if you have custom logic for manipulating a cache in a specific format.

```py
import torch
Expand Down
21 changes: 0 additions & 21 deletions docs/source/en/internal/generation_utils.md
Original file line number Diff line number Diff line change
Expand Up @@ -366,43 +366,22 @@ A [`Constraint`] can be used to force the generation to include specific tokens
- validate

[[autodoc]] DynamicCache
- update
- get_seq_length
- reorder_cache
- to_legacy_cache
- from_legacy_cache

[[autodoc]] QuantizedCache
- update
- get_seq_length

[[autodoc]] QuantoQuantizedCache

[[autodoc]] HQQQuantizedCache

[[autodoc]] OffloadedCache
- update
- prefetch_layer
- evict_previous_layer

[[autodoc]] StaticCache
- update
- get_seq_length
- reset

[[autodoc]] OffloadedStaticCache
- update
- get_seq_length
- reset

[[autodoc]] HybridCache
- update
- get_seq_length
- reset

[[autodoc]] SlidingWindowCache
- update
- reset

[[autodoc]] EncoderDecoderCache
- get_seq_length
Expand Down
7 changes: 7 additions & 0 deletions docs/source/en/model_doc/falcon_mamba.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,13 @@ outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## FalconMambaCache

[[autodoc]] FalconMambaCache
- update_conv_state
- update_ssm_state
- reset

## FalconMambaConfig

[[autodoc]] FalconMambaConfig
Expand Down
7 changes: 7 additions & 0 deletions docs/source/en/model_doc/mamba.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,13 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
trainer.train()
```

## MambaCache

[[autodoc]] MambaCache
- update_conv_state
- update_ssm_state
- reset

## MambaConfig

[[autodoc]] MambaConfig
Expand Down
2 changes: 0 additions & 2 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -358,7 +358,6 @@
"EncoderDecoderCache",
"HQQQuantizedCache",
"HybridCache",
"MambaCache",
"OffloadedCache",
"OffloadedStaticCache",
"QuantizedCache",
Expand Down Expand Up @@ -839,7 +838,6 @@
EncoderDecoderCache,
HQQQuantizedCache,
HybridCache,
MambaCache,
OffloadedCache,
OffloadedStaticCache,
QuantizedCache,
Expand Down
Loading
Loading