You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/cache_explanation.md
+7-13Lines changed: 7 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -82,24 +82,18 @@ When you use Transformers' [`Cache`] class, the self-attention module performs s
82
82
83
83
## Cache storage implementation
84
84
85
-
The actual storage of key-value pairs varies between cache implementations. As an example, consider the [`DynamicCache`].
85
+
Caches are structured as a list of layers, where each layer contains a key and value cache. The key and value caches are tensors with the shape `[batch_size, num_heads, seq_len, head_dim]`.
86
86
87
+
Layers can be of different types (e.g. `DynamicLayer`, `StaticLayer`, `SlidingWindowLayer`), which mostly changes how sequence length is handled and how the cache is updated.
87
88
88
-
In [`DynamicCache`], the key-value pairs are stored as two lists of tensors. Each tensor in the lists have the shape `[batch_size, num_heads, seq_len, head_dim]`.
89
-
-`key_cache`: A list of tensors, one for each layer.
90
-
-`value_cache`: A list of tensors, one for each layer.
89
+
The simplest is a `DynamicLayer` that grows as more tokens are processed. The sequence length dimension (`seq_len`) increases with each new token:
91
90
92
-
When new tokens are processed:
93
-
94
-
1. For each layer, the new key and value states are concatenated with the existing cache.
2. The cache grows dynamically as more tokens are processed. The sequence length dimension (`seq_len`) increases with each new token.
101
-
102
-
3. The cache maintains a count of seen tokens through `self._seen_tokens`. This is updated when the first layer processes a new token.
96
+
Other layers like `StaticLayer` and `SlidingWindowLayer` have a fixed sequence length that is set when the cache is created. This makes them compatible with `torch.compile`. In the case of `SlidingWindowLayer`, existing tokens are shifted out of the cache when a new token is added.
103
97
104
98
The example below demonstrates how to create a generation loop with [`DynamicCache`]. As discussed, the attention mask is a concatenation of past and current token values and `1` is added to the cache position for the next token.
105
99
@@ -143,7 +137,7 @@ The legacy format is essentially the same data structure but organized different
143
137
- The tensors have the same shape `[batch_size, num_heads, seq_len, head_dim]`.
144
138
- The format is less flexible and doesn't support features like quantization or offloading.
145
139
146
-
If your project depends on this legacy format, you can convert between [`DynamicCache`] and a tuple of tuples as shown below with the [`~DynamicCache.from_legacy_cache`] and [`DynamicCache.to_legacy_cache`] functions. This is helpful if you have custom logic for manipulating a cache in a specific format.
140
+
If your project depends on this legacy format, you can convert between [`DynamicCache`] and a tuple of tuples as shown below with the [`~Cache.from_legacy_cache`] and [`Cache.to_legacy_cache`] functions. This is helpful if you have custom logic for manipulating a cache in a specific format.
0 commit comments