Skip to content

Conversation

@yejj710
Copy link
Contributor

@yejj710 yejj710 commented Oct 28, 2025

background

Mooncake has already supported tiered caching, but currently only memory usage is tracked by mooncake_master

new feature

  • recording of SSD persistent storage usage
  • configuration for maximum available SSD capacity

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yejj710, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request extends Mooncake's storage monitoring capabilities beyond memory to include SSD persistent storage. It introduces a new configuration option for defining the total SSD capacity and implements the necessary logic within the MasterService and MasterMetricManager to accurately track allocated SSD space and report its usage. This enhancement provides a more comprehensive view of resource utilization in tiered caching environments.

Highlights

  • SSD Storage Usage Monitoring: Introduced new metrics and mechanisms to track the usage of SSD persistent storage, complementing the existing memory usage tracking in Mooncake's tiered caching system.
  • Configurable SSD Capacity: Added a new configuration parameter, global_file_segment_size, allowing users to define the maximum available capacity for SSD storage. This parameter has a default value of 500 GiB.
  • Metric Manager Enhancements: The MasterMetricManager now includes dedicated metrics for allocated file size and total file capacity on SSD, along with methods to increment, decrement, and retrieve these values, as well as calculate the global file usage ratio.
  • Dynamic Allocation Tracking: Integrated the new SSD usage tracking into the MasterService's PutStart and PutRevoke operations, ensuring that allocated_file_size is updated correctly when disk replicas are created or removed.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces monitoring for secondary storage (SSD), which is a great addition for tiered caching. The changes for configuration and metric collection are mostly well-implemented. However, I've found a critical issue in PutRevoke that could lead to a service crash, and a significant logic omission where allocated_file_size is not decremented upon object removal, which will cause the metric to be inaccurate over time. I've also included a few medium-severity suggestions to improve code readability and correctness. Please review the comments for details.

static constexpr int64_t DEFAULT_CLIENT_LIVE_TTL_SEC = 10; // in seconds
static const std::string DEFAULT_CLUSTER_ID = "mooncake_cluster";
static const std::string DEFAULT_ROOT_FS_DIR = "";
static const uint64_t DEFAULT_GLOBAL_FILE_SEGMENT_SIZE = 536870912000; // 500 GiB
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would confuse the user. How about setting it to u64::MAX and showing "infinite"? Since we don't limit DFS usage right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! i will fix it later.

};

// Serialize Gauges
serialize_metric(allocated_size_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename as mem_allocated_size? On the other hand, should we have a total_allocated_size metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for question 1: i have renamd memory related metrics already;
for question 2: I don't see concrete scenario for total_allocated_size. It would be more appropriate to extend the functionality when specific scenarios emerge in the future.

@stmatengss
Copy link
Collaborator

The current PR fails the Build CI Test. Please resolve the issues. @yejj710

/home/runner/work/Mooncake/Mooncake/mooncake-store/src/master_metric_manager.cpp:25:7: error: class ‘mooncake::MasterMetricManager’ does not have any field named ‘total_file_capacity_’
   25 |       total_file_capacity_("master_total_file_capacity_bytes",
      |       ^~~~~~~~~~~~~~~~~~~~
/home/runner/work/Mooncake/Mooncake/mooncake-store/src/master_metric_manager.cpp:27:7: error: class ‘mooncake::MasterMetricManager’ does not have any field named ‘allocated_file_size_’
   27 |       allocated_file_size_("master_allocated_file_size_bytes",
      |       ^~~~~~~~~~~~~~~~~~~~

// --- Get current values ---
int64_t allocated = allocated_size_.value();
int64_t capacity = total_capacity_.value();
int64_t allocated = mem_allocated_size_.value();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename it as mem_allocated to keep it readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This actually helped me identify a bug in the percentage utilization alerts

std::string file_path = ResolvePath(key);
replicas.emplace_back(file_path, total_length,
ReplicaStatus::PROCESSING);
MasterMetricManager::instance().inc_allocated_file_size(total_length);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure this is the only path of file allocation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I've double-checked the code. no problem

@yejj710 yejj710 changed the title [store] feat: add secondary storage usage monitor [WIP][store] feat: add secondary storage usage monitor Oct 30, 2025
@yejj710 yejj710 changed the title [WIP][store] feat: add secondary storage usage monitor [store] feat: add secondary storage usage monitor Oct 30, 2025
Copy link
Collaborator

@xiaguan xiaguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have some time, you could also update the docs that show up on our website.

If you don’t have much time, just let me know—the PR is ready to merge either way.

​Note​​: When enabling this feature, the user must ensure that the DFS-mounted directory (`root_fs_dir=/path/to/dir`) is valid and consistent across all client hosts. If some clients have invalid or incorrect mount paths, it may cause abnormal behavior in Mooncake Store.

#### Persistent Storage Space Configuration​
Mooncake provides configurable DFS available space. Users can specify `--global_file_segment_size=1048576` when starting the master, indicating a maximum usable space of 1MB on DFS.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just let users know it's just a metric or something—we didn't evict anything on DFS right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will add more descriptive information

return tl::make_unexpected(ErrorCode::INVALID_WRITE);
}
// When disk replica is enabled, update allocated_file_size
if (use_disk_replica_ && replica_type == ReplicaType::DISK) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we do it in ObjectMetadata's constructor and destructor? That's more RAII and the code looks cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move it to destructor sounds better, i will slove it today

@stmatengss
Copy link
Collaborator

@yejj710 Please confirm these issues are resolved so this PR can be merged before today's release.

@yejj710
Copy link
Contributor Author

yejj710 commented Oct 31, 2025

@yejj710 Please confirm these issues are resolved so this PR can be merged before today's release.

I have resolved these issues.

@stmatengss
Copy link
Collaborator

Cant pass CI, plz fix it. @yejj710

Copy link
Collaborator

@stmatengss stmatengss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the build success.

@yejj710
Copy link
Contributor Author

yejj710 commented Nov 2, 2025

Please make the build success.

done. please review it again @stmatengss

@stmatengss
Copy link
Collaborator

stmatengss commented Nov 3, 2025

@yejj710 Is it ready for merging?

@yejj710
Copy link
Contributor Author

yejj710 commented Nov 3, 2025

@yejj710 Is it ready for merging?

yes ~ @stmatengss

@stmatengss stmatengss merged commit 14aea87 into kvcache-ai:main Nov 3, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants