Skip to content

Conversation

@SgtPepperr
Copy link
Contributor

I am currently working on the KVcache SSD offload feature for the client side in the Mooncake project. The primary implementation approach involves

  • Initialize: specifying the storage path for files during the initialization phase. If the storage path is invalid, subsequent persistence operations will fail.
  • Put: During the put stage, for each successful put request, the client writes the data to the designated storage path using POSIX file write operations.
  • Get: In the get stage, for each failed get request, the system attempts to locate the corresponding KVcache from the local storage path. If found, the data is read from the file and returned correctly.

The persistence functionality is enabled through a precompiled parameter USE_CLIENT_PERSISTENCE.

[TODO] Currently, file write operations are still performed synchronously, but the related thread pool asynchronous interfaces have been implemented and will be modified to asynchronous operations in a subsequent commit after debugging is completed.

The current code represents the initial implementation of KVcache persistence on the client side. Future work will focus on refining the existing implementation, including adding comments, removing redundant code, improving readability and extensibility, adding test code, and updating documentation.

liuxingyu added 6 commits May 30, 2025 12:45
…gen-style comments to all header files - Removed redundant code and outdated comments - Optimized function execution logic in LocalFile
… add related description in doc"

This reverts commit 159442d.

revert old high-level api test and doc modification
…roduce the storage path through environment variables.
- Refactor write operations to use thread pool for async file I/O
- Fix potential double-unlock bug by adding atomic is_locked_ flag
- Add corrupted file cleanup on write failure:
  - Auto-delete files with failed writes in destructor
  - Prevent subsequent reads of corrupted data
@xiaguan
Copy link
Collaborator

xiaguan commented Jun 5, 2025

Thanks a lot for the contribution! This PR is a bit on the large side—would it be possible to break it up into smaller pieces?

BTW, we probably don't need USE_CLIENT_PERSISTENCE here, since this feature doesn't introduce any new dependencies.

@stmatengss stmatengss requested a review from Copilot June 5, 2025 12:01

This comment was marked as outdated.

cls.store = MooncakeDistributedStore()
get_client(cls.store)

@unittest.skipIf(os.getenv("MOONCAKE_STORAGE_ROOT_DIR"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it for passing the CI test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES,because now Teardownall with fs will not remove file. And this does not conform to the behavior of this test.

@stmatengss stmatengss merged commit 77f5a7b into kvcache-ai:main Jul 2, 2025
10 checks passed
@SgtPepperr SgtPepperr deleted the client-ssd-persistence branch July 3, 2025 03:20
@soyail
Copy link

soyail commented Jul 10, 2025

Can sglang use this feature now? Expect it !

201341 pushed a commit to 201341/Mooncake that referenced this pull request Jul 22, 2025
…#437)

* enable client ssd offload and storage persistence

* add storage_root_path in all tests setup() initialization and add related description in doc

* clean up headers and improve code readability - Added consistent Doxygen-style comments to all header files - Removed redundant code and outdated comments - Optimized function execution logic in LocalFile

* Revert "add storage_root_path in all tests setup() initialization and add related description in doc"

This reverts commit 159442d.

revert old high-level api test and doc modification

* Restore the high-level API to its original state and modify it to introduce the storage path through environment variables.

* add local_file_test and thread_pool_test

* feat(client_ssd_offload): implement async writes and fix locking bugs

- Refactor write operations to use thread pool for async file I/O
- Fix potential double-unlock bug by adding atomic is_locked_ flag
- Add corrupted file cleanup on write failure:
  - Auto-delete files with failed writes in destructor
  - Prevent subsequent reads of corrupted data

* add support for remove , remove_all , isexist interface etc.

* feat(kvcache): implement cluster isolation with session IDs

    * Remove precompilation parameters to simplify build configuration
    * Add session ID mechanism for cluster isolation:
      - Master node now generates unique session IDs on initialization
      - All persistent operations are scoped under session-specific subdirectories

* edit two parameters client get, add persisitence path in client rather than store_py.cpp

* add support for batch api conflict , refactor replica.descriptor to support file and memory type

* add test branch

* add ci ssd

* change python test

* add log for fail

* change querykey return value type

* fix bug

* fix bug

* add sleep for removefile

* fix sleep

* edit ci.yml and fix delete before write problem

* add comment for storage_backend

* spell check

* fix name problem and decrease errorcode for file

* add pytest for ssd offload

* edit test

* edit test

* fix test

* fix bug

* fix test

* Modify the thread pool value capture to reference capture to fix the issue of significant performance degradation when writing files with put.

* add async getfrom file in batchget transfertask. delete file_storage_backend

* add support for HA in cluster_id subdirectory, change session_id to fsdir

* add persistence in batchput

* add disk allocate for get_into py interface

* fix bug

* temp

* fix bug in submit fileread task for std:move(slices)

* edit querykey to return optional<descriptor>, add interface batchquerykey for storagebackend

* fix confict in batchget, add batchget/batchput test

* fix bug

* fix bug

* comment batch test

* fix conflict and add batch_get_into file test

* fix test bug

* fix test bug

* fix conflict
@stmatengss stmatengss mentioned this pull request Aug 12, 2025
29 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants