-
Notifications
You must be signed in to change notification settings - Fork 420
[Store]feat: Migrate Persistence Metadata from Client to Master Service #690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
xiaguan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the master should handle filesystem allocation this way.
We should provide the master with a parameter indicating it can perform filesystem allocation, where the root path is xxx. Then it can simply append the object key to the root path?
mooncake-store/src/client.cpp
Outdated
| value.append(static_cast<char*>(slice.ptr), slice.size); | ||
| } | ||
|
|
||
| write_thread_pool_.enqueue( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we could move the write_thread_pool down to the storage backend, as it's not directly related to the Client class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the write_thread_pool involves calling putend and putrevoke from the master-client, from an abstraction perspective, I believe the storage layer should only handle file read/write operations, while putend and similar operations are part of the client->master interaction logic. Therefore, I think moving the write_thread_pool down to the storage layer might require further discussion.
Yes, I also believe that after migrating the metadata to be managed by the master, the persistence switch and corresponding paths should also be controlled by the master. The latest commit has already made the corresponding changes. Additionally, considering that with the master managing the metadata, file read/write concurrency will be controlled by the master and will not trigger various read/write conflicts, the file read/write locks have also been removed. |
|
Hello, I'd like to inquire about a question regarding high availability mode: When a master switchover occurs, will the new master delete both the in-memory data and disk files from the previous master? |
xiaguan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some grammar suggestions. No major changes needed, but we should add some coverage in master_service_test.cpp for the modifications. Others LGTM.
In the current high-availability implementation, when a master switchover occurs, the client automatically reconnects to the new master and mounts the corresponding segment. However, all mem-kv data is cleared, and the new master's metadata is also empty, so it cannot query any previously saved kv pairs, requiring a re-Put operation. As for disk-kv, an automatic file deletion mechanism has not yet been introduced. Although disk-kv cannot be indexed by the new master, the kv files remain in their original paths. In the future, we plan to introduce eviction or related mechanisms to ensure the deletion of disk-kv files. |
Thank you for your response. May I ask if there will be similar persistent operations for KV metadata in the future to ensure the cache remains available after a master switch? |
Yes, according to my understanding, the high-availability mode's TODO list does include the recovery of master metadata after a failure. You can refer to the description in the previous PR #451 regarding this matter. |
Thanks!I have added a |
ok thx |
xiaguan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
One more thing: Could you check if it has conflicts with #710? |
At the moment, there’s no direct conflict with the existing code; it’s just that the VRAM-side implementation will likely need to add file-handling logic inside the |
|
Given that another PR #710 will require modifying the replica configuration to specify VRAM-based PUT operations, and the SSD feature also needs changes to the replica config to let users control SSD writes and to validate the client mount, we plan to submit a new PR later that introduces a unified resource type for replica configuration. This will consolidate replica metadata for DRAM, VRAM, and SSD into a single, consistent schema. |
…ce (kvcache-ai#690) * initial commit * fix client::query return fault * fix isexist return fault * fix test bug * fix clearinvalidhandles problem * add file description for 3fs * change ssd function start from client to master * fix naming error * edit doc description * edit doc * clang format * fix as the review comment * fix formmat * add master service test for ssd * fix format * add log and cli * fix putend test


Background
Based on the iteration roadmap discussed in previous PRs #610 and issues #578 , this release migrates the metadata management of the SSD KV-cache in the hierarchical caching feature from the client to the master service, preserving the control-data separation design philosophy of Mooncake.
The main benefits are:
BatchGetread performance on 3FS.Design & Implementation
The key idea is to extend the notion of replica so it can be either memory or disk.
PutStartnow returns both memory replica and a disk replica.The client writes data to each independently.
Disk-replica file writes are asynchronous (handled by a dedicated file-writer thread pool) and therefore do not block the synchronous path.
PutEndRPC is extended:– The synchronous path issues
PutEndfor the memory replica.– The asynchronous path issues
PutEndfor the disk replica.Getchooses the appropriate read path depending on replica type.Evictis modified: instead of deleting the entire metadata entry, only the memory replica portion is removed.The persistence switch is still controlled by the client:
– If a storage path is specified → enable disk replica (
Put/Get).– If not specified → skip all disk-related operations.
The persistence feature's switch has been moved from the client to the master side, where enabling persistence is now controlled by specifying (--root_fs_dir=/path/to/dir)during master startup, requiring all client hosts to mount their DFS directories under this specified path to avoid potential abnormal behavior in Mooncake Store.
State Transitions
Moving metadata from the client to the master increases the complexity of state management:
memoryonlymem,mem+disk,diskFurther, each replica has its own status (e.g.,
PROCESSING,COMPLETED), so the combined state space explodes.Explicit tests and eventually state-machine diagrams will be added to prevent bugs.
Basic transition rules
stateDiagram [*] --> empty empty --> mem : Put empty --> disk : Put empty --> mem+disk : Put mem --> empty : Remove mem --> empty : Evict disk --> empty : Remove mem+disk --> disk : Evict mem+disk --> empty : RemoveCorner Cases
Mixed persistence settings
3 clients, 2 with valid mount directory, 1 with invalid mount directory.
– potential abnormal behavior may happen in Mooncake Store.
Memory write fails, disk write succeeds (mem+disk path)
– Memory failure triggers
PutRevoke, removing the memory replica.– Disk success triggers
PutEnd, leaving aCOMPLETEDdisk replica usable by later reads.New
Puton an existing pure-disk replica– Currently rejected with “object exists”.
– Future plan: allow re-adding a memory replica, transitioning
disk→mem+disk.Get while disk replica is still processing
– If the memory replica exist and are
COMPLETED, the master returns all completed replicas, ignoring the still-processing disk replica.