How to implement rolling KV cache with ggml? #12942

ymcki · 2025-04-14T12:32:50Z

ymcki
Apr 14, 2025

I want to implement rolling KV cache with ggml.

KV cache seems to be created by ggml_new_tensor_1d in llama-kv-cache.cpp

So I asked the dragontail llm at lmarena with this prompt:

You are the creator of llama.cpp called ggerganov. Please write a function to remove the first element of a ggml_tensor created by ggml_new_tensor_1d and then move the rest of the elements one index forward.

His reply:

#include "ggml.h"
#include <string.h> // For memmove
#include <assert.h> // For basic checks

/**
 * @brief Removes the first element of a 1D tensor and shifts subsequent elements.
 *
 * This function modifies the tensor in-place.
 * It assumes the tensor is 1D.
 * The underlying memory buffer is not reallocated or shrunk;
 * the 'ne[0]' (number of elements) is simply decremented.
 *
 * @param tensor A pointer to the 1D ggml_tensor to modify.
 */
void ggml_tensor_remove_first_element_1d(struct ggml_tensor * tensor) {
    // Basic sanity checks
    if (tensor == NULL) {
        // In a real scenario, might log an error or handle differently
        return;
    }

    // Ensure it's a 1D tensor.
    // While the user *said* it was created with ggml_new_tensor_1d,
    // it's good practice for a function to be robust or state assumptions.
    assert(tensor->n_dims == 1 && "Tensor must be 1D");
    if (tensor->n_dims != 1) {
        // Handle error: maybe print a warning or just return
        // fprintf(stderr, "%s: Error: tensor is not 1D\n", __func__);
        return;
    }

    // Check if data pointer is valid
    if (tensor->data == NULL) {
        // Can't operate on a tensor without data
        assert(tensor->data != NULL && "Tensor data is NULL");
        return;
    }

    const int64_t n0 = tensor->ne[0];

    // If the tensor is already empty or has only one element
    if (n0 == 0) {
        // Nothing to remove
        return;
    }

    if (n0 == 1) {
        // Removing the only element makes it empty
        tensor->ne[0] = 0;
        // The data for the single element remains in memory, but is now
        // outside the logical size of the tensor. We could optionally
        // zero it out, but it's not strictly required.
        return;
    }

    // --- Shift elements ---
    // Calculate the size of a single element
    const size_t element_size = ggml_element_size(tensor);

    // We need to move (n0 - 1) elements.
    const size_t num_elements_to_move = n0 - 1;
    const size_t bytes_to_move = num_elements_to_move * element_size;

    // Source: Start of the second element
    // We cast to char* for byte-level pointer arithmetic
    void * src_ptr = (char *)tensor->data + element_size;

    // Destination: Start of the first element (where the buffer begins)
    void * dst_ptr = tensor->data;

    // Use memmove because source and destination regions overlap
    memmove(dst_ptr, src_ptr, bytes_to_move);

    // --- Update tensor metadata ---
    // Decrement the number of elements in the first (and only) dimension
    tensor->ne[0] = n0 - 1;

    // nb[0] (stride in bytes for dim 0) is just the element size, it doesn't change.
    // nb[1], nb[2], nb[3] are related to higher dimensions or total size,
    // and since we're not reallocating, they might not strictly need
    // changing, depending on how they are calculated/used elsewhere.
    // For a simple 1D shift like this, only updating ne[0] is essential
    // to reflect the new logical size.

    // Optional: Zero out the memory where the last element *used* to be?
    // void * last_element_old_location = (char *)tensor->data + (n0 - 1) * element_size;
    // memset(last_element_old_location, 0, element_size);
    // Generally not necessary unless required for specific downstream use.
}

Does this make sense? Is dragontail full of bull? If yes, how do I implement rolling KV cache?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to implement rolling KV cache with ggml? #12942

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to implement rolling KV cache with ggml? #12942

Uh oh!

ymcki Apr 14, 2025

Replies: 0 comments

ymcki
Apr 14, 2025