Mooncake Store provides put()/get() interfaces with single key for now, it's ok for a couple of tokens' KV save/load.
While in vLLM and SGLang, an optimization is transfering KVs layer-by-layer for better compute-communication overlapping.
vllm offloading
sglang cache controller
However, when KVs split into layers, the value size will decreased sharply, and the number of requests increased accordingly.
To reduce the overhead of single key put()/get, we may provide batched python interfaces as:
def batch_put(key : list[str], value : list[byte]):
...
def batch_get(key : list[str]) -> list[byte]
...
To add batch_put(), batch_get() interfaces, some considerations are needed:
- Should be called asyncly
- Should provide detailed status for each key/value
- Should update metrics correctly
- (Optional)Better storage and network-bandwidth utilizing
- (Optional)Auto-batching, keep higher-level api unchanged
Welcome to codesign & review & contribue & PR!