Skip to content

Conversation

@SgtPepperr
Copy link
Contributor

@SgtPepperr SgtPepperr commented Jul 29, 2025

Background

Based on the iteration roadmap discussed in previous PRs #610 and issues #578 , this release migrates the metadata management of the SSD KV-cache in the hierarchical caching feature from the client to the master service, preserving the control-data separation design philosophy of Mooncake.
The main benefits are:

  • Improved BatchGet read performance on 3FS.
  • Stronger data-consistency guarantees.

Design & Implementation

The key idea is to extend the notion of replica so it can be either memory or disk.

  • PutStart now returns both memory replica and a disk replica.
    The client writes data to each independently.
    Disk-replica file writes are asynchronous (handled by a dedicated file-writer thread pool) and therefore do not block the synchronous path.
  • The PutEnd RPC is extended:
    – The synchronous path issues PutEnd for the memory replica.
    – The asynchronous path issues PutEnd for the disk replica.
  • Get chooses the appropriate read path depending on replica type.
  • Evict is modified: instead of deleting the entire metadata entry, only the memory replica portion is removed.
    The persistence switch is still controlled by the client:
    – If a storage path is specified → enable disk replica (Put/Get).
    – If not specified → skip all disk-related operations.

The persistence feature's switch has been moved from the client to the master side, where enabling persistence is now controlled by specifying (--root_fs_dir=/path/to/dir)during master startup, requiring all client hosts to mount their DFS directories under this specified path to avoid potential abnormal behavior in Mooncake Store.

State Transitions

Moving metadata from the client to the master increases the complexity of state management:

Old model New model
memory only mem, mem+disk, disk

Further, each replica has its own status (e.g., PROCESSING, COMPLETED), so the combined state space explodes.
Explicit tests and eventually state-machine diagrams will be added to prevent bugs.

Basic transition rules

stateDiagram
    [*] --> empty

    empty --> mem      : Put
    empty --> disk     : Put
    empty --> mem+disk  : Put

    mem     --> empty : Remove
    mem     --> empty : Evict

    disk    --> empty : Remove

    mem+disk --> disk  : Evict
    mem+disk --> empty     : Remove
Loading

Corner Cases

  1. Mixed persistence settings
    3 clients, 2 with valid mount directory, 1 with invalid mount directory.
    – potential abnormal behavior may happen in Mooncake Store.

  2. Memory write fails, disk write succeeds (mem+disk path)
    – Memory failure triggers PutRevoke, removing the memory replica.
    – Disk success triggers PutEnd, leaving a COMPLETED disk replica usable by later reads.

  3. New Put on an existing pure-disk replica
    – Currently rejected with “object exists”.
    – Future plan: allow re-adding a memory replica, transitioning diskmem+disk.

  4. Get while disk replica is still processing
    – If the memory replica exist and are COMPLETED, the master returns all completed replicas, ignoring the still-processing disk replica.

@SgtPepperr SgtPepperr marked this pull request as draft July 29, 2025 12:12
@SgtPepperr SgtPepperr marked this pull request as ready for review July 30, 2025 07:39
@xiaguan xiaguan self-requested a review July 31, 2025 03:26
Copy link
Collaborator

@xiaguan xiaguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the master should handle filesystem allocation this way.

We should provide the master with a parameter indicating it can perform filesystem allocation, where the root path is xxx. Then it can simply append the object key to the root path?

value.append(static_cast<char*>(slice.ptr), slice.size);
}

write_thread_pool_.enqueue(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we could move the write_thread_pool down to the storage backend, as it's not directly related to the Client class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the write_thread_pool involves calling putend and putrevoke from the master-client, from an abstraction perspective, I believe the storage layer should only handle file read/write operations, while putend and similar operations are part of the client->master interaction logic. Therefore, I think moving the write_thread_pool down to the storage layer might require further discussion.

@SgtPepperr
Copy link
Contributor Author

I don't think the master should handle filesystem allocation this way.

We should provide the master with a parameter indicating it can perform filesystem allocation, where the root path is xxx. Then it can simply append the object key to the root path?

Yes, I also believe that after migrating the metadata to be managed by the master, the persistence switch and corresponding paths should also be controlled by the master. The latest commit has already made the corresponding changes.

Additionally, considering that with the master managing the metadata, file read/write concurrency will be controlled by the master and will not trigger various read/write conflicts, the file read/write locks have also been removed.

@SgtPepperr SgtPepperr requested a review from xiaguan August 1, 2025 06:34
@SpecterCipher
Copy link

Hello, I'd like to inquire about a question regarding high availability mode: When a master switchover occurs, will the new master delete both the in-memory data and disk files from the previous master?

Copy link
Collaborator

@xiaguan xiaguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some grammar suggestions. No major changes needed, but we should add some coverage in master_service_test.cpp for the modifications. Others LGTM.

@SgtPepperr
Copy link
Contributor Author

Hello, I'd like to inquire about a question regarding high availability mode: When a master switchover occurs, will the new master delete both the in-memory data and disk files from the previous master?

In the current high-availability implementation, when a master switchover occurs, the client automatically reconnects to the new master and mounts the corresponding segment. However, all mem-kv data is cleared, and the new master's metadata is also empty, so it cannot query any previously saved kv pairs, requiring a re-Put operation.

As for disk-kv, an automatic file deletion mechanism has not yet been introduced. Although disk-kv cannot be indexed by the new master, the kv files remain in their original paths.

In the future, we plan to introduce eviction or related mechanisms to ensure the deletion of disk-kv files.

@SpecterCipher
Copy link

Hello, I'd like to inquire about a question regarding high availability mode: When a master switchover occurs, will the new master delete both the in-memory data and disk files from the previous master?

In the current high-availability implementation, when a master switchover occurs, the client automatically reconnects to the new master and mounts the corresponding segment. However, all mem-kv data is cleared, and the new master's metadata is also empty, so it cannot query any previously saved kv pairs, requiring a re-Put operation.

As for disk-kv, an automatic file deletion mechanism has not yet been introduced. Although disk-kv cannot be indexed by the new master, the kv files remain in their original paths.

In the future, we plan to introduce eviction or related mechanisms to ensure the deletion of disk-kv files.

Thank you for your response. May I ask if there will be similar persistent operations for KV metadata in the future to ensure the cache remains available after a master switch?

@SgtPepperr
Copy link
Contributor Author

Hello, I'd like to inquire about a question regarding high availability mode: When a master switchover occurs, will the new master delete both the in-memory data and disk files from the previous master?

In the current high-availability implementation, when a master switchover occurs, the client automatically reconnects to the new master and mounts the corresponding segment. However, all mem-kv data is cleared, and the new master's metadata is also empty, so it cannot query any previously saved kv pairs, requiring a re-Put operation.
As for disk-kv, an automatic file deletion mechanism has not yet been introduced. Although disk-kv cannot be indexed by the new master, the kv files remain in their original paths.
In the future, we plan to introduce eviction or related mechanisms to ensure the deletion of disk-kv files.

Thank you for your response. May I ask if there will be similar persistent operations for KV metadata in the future to ensure the cache remains available after a master switch?

Yes, according to my understanding, the high-availability mode's TODO list does include the recovery of master metadata after a failure. You can refer to the description in the previous PR #451 regarding this matter.
image

@SgtPepperr
Copy link
Contributor Author

Just some grammar suggestions. No major changes needed, but we should add some coverage in master_service_test.cpp for the modifications. Others LGTM.

Thanks!I have added a master_service_ssd_test.cpp to cover the correctness testing of MasterService behavior when the SSD offload feature is enabled.

@SgtPepperr SgtPepperr requested a review from xiaguan August 4, 2025 08:11
@SpecterCipher
Copy link

Hello, I'd like to inquire about a question regarding high availability mode: When a master switchover occurs, will the new master delete both the in-memory data and disk files from the previous master?

In the current high-availability implementation, when a master switchover occurs, the client automatically reconnects to the new master and mounts the corresponding segment. However, all mem-kv data is cleared, and the new master's metadata is also empty, so it cannot query any previously saved kv pairs, requiring a re-Put operation.
As for disk-kv, an automatic file deletion mechanism has not yet been introduced. Although disk-kv cannot be indexed by the new master, the kv files remain in their original paths.
In the future, we plan to introduce eviction or related mechanisms to ensure the deletion of disk-kv files.

Thank you for your response. May I ask if there will be similar persistent operations for KV metadata in the future to ensure the cache remains available after a master switch?

Yes, according to my understanding, the high-availability mode's TODO list does include the recovery of master metadata after a failure. You can refer to the description in the previous PR #451 regarding this matter. image

ok thx

Copy link
Collaborator

@xiaguan xiaguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stmatengss
Copy link
Collaborator

One more thing: Could you check if it has conflicts with #710?

@SgtPepperr
Copy link
Contributor Author

One more thing: Could you check if it has conflicts with #710?

At the moment, there’s no direct conflict with the existing code; it’s just that the VRAM-side implementation will likely need to add file-handling logic inside the putToVram function later on.

@SgtPepperr
Copy link
Contributor Author

Given that another PR #710 will require modifying the replica configuration to specify VRAM-based PUT operations, and the SSD feature also needs changes to the replica config to let users control SSD writes and to validate the client mount, we plan to submit a new PR later that introduces a unified resource type for replica configuration. This will consolidate replica metadata for DRAM, VRAM, and SSD into a single, consistent schema.

@stmatengss stmatengss mentioned this pull request Aug 12, 2025
29 tasks
@stmatengss stmatengss merged commit 81f492c into kvcache-ai:main Aug 14, 2025
11 checks passed
@SgtPepperr SgtPepperr deleted the ssd_master branch August 14, 2025 06:44
XucSh pushed a commit to XucSh/Mooncake that referenced this pull request Aug 14, 2025
…ce (kvcache-ai#690)

* initial commit

* fix client::query return fault

* fix isexist return fault

* fix test bug

* fix clearinvalidhandles problem

* add file description for 3fs

* change ssd function start from client to master

* fix naming error

* edit doc description

* edit doc

* clang format

* fix as the review comment

* fix formmat

* add master service test for ssd

* fix format

* add log and cli

* fix putend test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants