-
Notifications
You must be signed in to change notification settings - Fork 420
[Store] Enable Client SSD Offload And Storage Persistence #437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Store] Enable Client SSD Offload And Storage Persistence #437
Conversation
…ated description in doc
…gen-style comments to all header files - Removed redundant code and outdated comments - Optimized function execution logic in LocalFile
… add related description in doc" This reverts commit 159442d. revert old high-level api test and doc modification
…roduce the storage path through environment variables.
- Refactor write operations to use thread pool for async file I/O - Fix potential double-unlock bug by adding atomic is_locked_ flag - Add corrupted file cleanup on write failure: - Auto-delete files with failed writes in destructor - Prevent subsequent reads of corrupted data
|
Thanks a lot for the contribution! This PR is a bit on the large side—would it be possible to break it up into smaller pieces? BTW, we probably don't need USE_CLIENT_PERSISTENCE here, since this feature doesn't introduce any new dependencies. |
* Remove precompilation parameters to simplify build configuration
* Add session ID mechanism for cluster isolation:
- Master node now generates unique session IDs on initialization
- All persistent operations are scoped under session-specific subdirectories
…r than store_py.cpp
…upport file and memory type
…ykey for storagebackend
| cls.store = MooncakeDistributedStore() | ||
| get_client(cls.store) | ||
|
|
||
| @unittest.skipIf(os.getenv("MOONCAKE_STORAGE_ROOT_DIR"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it for passing the CI test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YES,because now Teardownall with fs will not remove file. And this does not conform to the behavior of this test.
|
Can sglang use this feature now? Expect it ! |
…#437) * enable client ssd offload and storage persistence * add storage_root_path in all tests setup() initialization and add related description in doc * clean up headers and improve code readability - Added consistent Doxygen-style comments to all header files - Removed redundant code and outdated comments - Optimized function execution logic in LocalFile * Revert "add storage_root_path in all tests setup() initialization and add related description in doc" This reverts commit 159442d. revert old high-level api test and doc modification * Restore the high-level API to its original state and modify it to introduce the storage path through environment variables. * add local_file_test and thread_pool_test * feat(client_ssd_offload): implement async writes and fix locking bugs - Refactor write operations to use thread pool for async file I/O - Fix potential double-unlock bug by adding atomic is_locked_ flag - Add corrupted file cleanup on write failure: - Auto-delete files with failed writes in destructor - Prevent subsequent reads of corrupted data * add support for remove , remove_all , isexist interface etc. * feat(kvcache): implement cluster isolation with session IDs * Remove precompilation parameters to simplify build configuration * Add session ID mechanism for cluster isolation: - Master node now generates unique session IDs on initialization - All persistent operations are scoped under session-specific subdirectories * edit two parameters client get, add persisitence path in client rather than store_py.cpp * add support for batch api conflict , refactor replica.descriptor to support file and memory type * add test branch * add ci ssd * change python test * add log for fail * change querykey return value type * fix bug * fix bug * add sleep for removefile * fix sleep * edit ci.yml and fix delete before write problem * add comment for storage_backend * spell check * fix name problem and decrease errorcode for file * add pytest for ssd offload * edit test * edit test * fix test * fix bug * fix test * Modify the thread pool value capture to reference capture to fix the issue of significant performance degradation when writing files with put. * add async getfrom file in batchget transfertask. delete file_storage_backend * add support for HA in cluster_id subdirectory, change session_id to fsdir * add persistence in batchput * add disk allocate for get_into py interface * fix bug * temp * fix bug in submit fileread task for std:move(slices) * edit querykey to return optional<descriptor>, add interface batchquerykey for storagebackend * fix confict in batchget, add batchget/batchput test * fix bug * fix bug * comment batch test * fix conflict and add batch_get_into file test * fix test bug * fix test bug * fix conflict
I am currently working on the KVcache SSD offload feature for the client side in the Mooncake project. The primary implementation approach involves
The persistence functionality is enabled through a precompiled parameter
USE_CLIENT_PERSISTENCE.[TODO] Currently, file write operations are still performed synchronously, but the related thread pool asynchronous interfaces have been implemented and will be modified to asynchronous operations in a subsequent commit after debugging is completed.
The current code represents the initial implementation of KVcache persistence on the client side. Future work will focus on refining the existing implementation, including adding comments, removing redundant code, improving readability and extensibility, adding test code, and updating documentation.