Skip to content

Conversation

@saikat-royc
Copy link
Collaborator

@saikat-royc saikat-royc commented Nov 5, 2025

Description

Start with a short description of what the PR does and how this is a change from
the past.

This PR introduces a bucketing strategy of the swap operations. This decomposes operations involving arbitrary numbers of KV cache blocks into smaller, predefined block size buckets. This approach leverages JAX's JIT compilation cache, avoiding costly recompilations for varying input shapes during the hot path of serving.

The rest of the description includes relevant details and context, examples:

  • why is this change being made,
  1. prepare precompile functions which will cycle through the load and save jitted functions
  2. decompose the load and save util functions to be block buckets aligned. This would ensure that compilation is not triggered in the hot path of the serving
  • the problem being solved and any relevant context,
  • why this is a good solution,
  • some information about the specific implementation,
  • shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

  • existing unit tests
  • sglang tool benchmark. some preliminary test cases show x% improvement in the tokens/s and TTFT improvements, because we are avoiding any recompilation during the actual load and save operations

Checklist

Before submitting this PR, please make sure:

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have made or will make corresponding changes to any relevant documentation.

@saikat-royc saikat-royc force-pushed the cpu-offload branch 2 times, most recently from d1c5c0f to 5c6f31d Compare November 5, 2025 23:21
@saikat-royc saikat-royc changed the title [WIP] [TPU host offload] Setup precompile functions to TPU host offload [TPU host offload] Setup precompile functions to TPU host offload Nov 6, 2025
1. prepare precompile functions which will cycle through the load and
save jitted functions
2. decompose the load and save util functions to be block buckets aligned
3. unit tests for the change

Signed-off-by: Saikat Roychowdhury <saikat.royc85@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant