[Store]feat: Add 3fs native api plugin for KVCache storage persistence #610

SgtPepperr · 2025-07-10T07:45:33Z

Background

In the previous PR #437, the initial version of KV cache persistence and tiering functionality was implemented. However, since POSIX interfaces were used for read/write operations, the performance in testing was relatively modest.

Changes Introduced

To improve file read/write performance and make the tiered caching functionality fully viable, this PR introduces the 3fs native API（USRBIO) into the store project as a plugin, significantly reducing the latency of performance-sensitive operations such as get and batchget when reading files.

Key Updates:

Added 3fs native API support
Enables high-performance read operations in 3fs scenarios
Refactored file_interface
Improved support for both POSIX and 3fs read/write modes
Enhanced KV cache persistence
Added batchput mode support for KV cache persistence
Benchmark tool update
Added persistence path specification option to stress_cluster_benchmark.py

Performance Testing

Test Setup

Tool: stress_cluster_benchmark.py (testing batch_put_from and batch_get_into)
Workflow:
1. Prefill on one node
2. Decode on another node
Configuration:
- file_read_thread: 10 (consistent across tests)
- Nodes used: 3
  - Node A: General server running:
    - 3fs meta service
    - Mooncake master + meta services
  - Node B: Storage server running:
    - 3fs storage service
    - Mooncake client (prefill/decode)
  - Node C: Storage server running:
    - 3fs storage service
    - Mooncake client (prefill/decode)
- Data flow: Bidirectional data transfer between Nodes B and C

Test Results（3fs Native API BatchGet Throughput Comparison）

Configuration	Batch Count	BatchGet Throughput (MB/s)
Value Size: 16MB
put: mem+disk get: disk	1	3113
	4	5247
	8	6206
	16	6778
Value Size: 64MB
put: mem+disk get: disk	1	3913
	4	7114
	8	8169
	16	8476

Problem

1. Asynchronous Write Performance Issue

Summary: Identified performance degradation in 3FS asynchronous writes and proposed solution.

Currently, persistence is achieved through asynchronous writes, but before asynchronous writing in 3FS, significant performance degradation may occur due to data copying. Profiling reveals that the number of page faults triggered in this scenario is nearly double the normal count. Future plans include introducing a reuse buffer list to address this performance degradation issue.

2. Read Path Copy Overhead

Summary: Existing read path copy mechanism and deferred optimization rationale.

Currently, the read path incurs an additional copy (iov → slice) due to 3FS's native API mechanism. This copy could be eliminated by ensuring the upper-layer slice uses shared memory (shm) and directly reuses it as the iov's address space, enabling zero-copy. However, this optimization has been temporarily postponed because it would require:

Modifications to the upper Python interface
Exposure of 3FS implementation details to the application layer
codebase intrusiveness

…p api in use

xiaguan

Overall, let’s treat 3fs as an optional plugin, so anyone who skips it sees a clean, untouched codebase.

mooncake-store/include/file_interface.h

mooncake-store/include/storage_backend.h

xiaguan · 2025-07-17T06:43:21Z

mooncake-store/src/hf3fs/threefs_file.cpp

+    }
+}
+
+ssize_t ThreeFSFile::write(const std::string& buffer, size_t length) {


Perchance, eschew std::string as the buffer argument? Consider adopting a span or a plain char* instead.

Issue Description

Attempted to use std::span for data management and called StoreObject, constructing auto buffer = std::make_shared<std::vector<char>>(total_size) in the PutToLocalFile function and converting it to std::span<char> format when calling StoreObject. However, local testing revealed a performance degradation in Put compared to the original implementation.

Current Status

• Interface Layer: Added interfaces with std::span<char> parameters in StorageBackend and FileInterface.

• Call Layer: The asynchronous function PutToLocalFile in the Client side remains unchanged (not yet migrated to std::span).

xiaguan · 2025-07-17T06:47:38Z

Kindly incorporate a concise performance summary within the pull-request description.

…ols for clarity, and ensure StorageFile is fully visible before ThreeFSFile inherits it.

LuyuZhang00 · 2025-07-19T09:41:29Z

This is a fantastic feature! Could you please add usage instructions to the documentation?

SgtPepperr · 2025-07-21T02:34:41Z

I will add 3fs feature user guide in documentation soon.

xiaguan

Nice work! A couple tiny nits.

mooncake-store/include/file_interface.h

xiaguan · 2025-07-21T10:58:43Z

mooncake-store/include/file_interface.h

+     * @return Number of bytes written on success, -1 on error
+     * @note Thread-safe operation with write locking
+     */
+    virtual ssize_t write(std::span<const char> data, size_t length) = 0;


it's duplicated?

While initially planning to change the write method from string to span<char>, corresponding overloaded methods were implemented in both storage_backend and file_interface. However, testing revealed that using the span<char> format in puttolocalfile() showed poor performance, so the upper layer continues to use the string interface. The overloaded interfaces are retained for potential future optimization after further analysis of the performance issues.

mooncake-store/include/storage_backend.h

xiaguan

Well done. No major changes needed.

Just make sure to handle errors for some edge cases.

xiaguan · 2025-07-22T10:47:33Z

mooncake-store/include/hf3fs/hf3fs.h

+    // USRBIO related parameters
+    std::string mount_root = "/";    // Mount point root directory
+    size_t iov_size = 32 << 20;         // Shared memory size (32MB)
+    size_t ior_entries = 16;             // Maximum number of requests in IO ring


if the batch size is greater than 16, what will happen?

Each thread has its own USRBIO resources (iov, ior, etc.), so ior is separeted in batchget now. Besides in the current implementation, only one I/O request is submitted to the ior at a time, waiting for completion before submitting the next, thus avoiding ior overflow (splitting 32MB into 4*8MB I/O requests showed no significant performance gain in local tests, so this approach was not adopted).

xiaguan · 2025-07-22T10:48:00Z

mooncake-store/include/hf3fs/hf3fs.h

+
+    // USRBIO related parameters
+    std::string mount_root = "/";    // Mount point root directory
+    size_t iov_size = 32 << 20;         // Shared memory size (32MB)


Same, if value size bigger than 32MB, what will happen?

The current implementation handles values exceeding iov_size by splitting the operation into multiple read-and-copy iterations within a loop (e.g., for 64MB data, it performs two passes to read into the iov and copy to slices).

mooncake-store/src/hf3fs/hf3fs_file.cpp

stmatengss · 2025-07-24T06:16:12Z

mooncake-store/include/storage_backend.h

+    tl::expected<void, ErrorCode> LoadObject(std::string& path, std::string& str, size_t length) ;
+
+    /**
+     * @brief Checks if an object with the given key exists


Don't change the order? These changes to Existkey are unnecessary.

stmatengss · 2025-07-24T06:20:24Z

mooncake-store/src/hf3fs/README.md

@@ -0,0 +1,42 @@
+# Mooncake HF3FS Plugin


Consider moving it to the doc dir.

stmatengss · 2025-07-24T06:37:26Z

mooncake-store/src/hf3fs/hf3fs_file.cpp

+        return make_error<size_t>(ErrorCode::FILE_WRITE_FAIL);
+    }
+
+    return total_bytes_written;


I'm not sure the return value is correct

initial version for 3fs native api

190cf83

SgtPepperr marked this pull request as draft July 10, 2025 07:45

sgt added 9 commits July 11, 2025 14:34

Merge branch 'main' into 3fs_native

1d4c992

delete redundant code in file_interface and storage_backend. Only kee…

20faa75

…p api in use

refactor some problem

f110847

refactor file interface, add lockRAII

63c16f2

fix some problem

73c4060

fix conflict

09b7be7

fix test problem

049aa8d

rm char file interface , edit filemode

9eb931a

edit cmake file , mkdir hf3fs in store

4c78b01

SgtPepperr marked this pull request as ready for review July 16, 2025 07:27

xiaguan self-requested a review July 17, 2025 06:29

xiaguan reviewed Jul 17, 2025

View reviewed changes

sgt added 4 commits July 18, 2025 11:20

Refactor 3FS plugin integration: break circular includes, rename symb…

04023b8

…ols for clarity, and ensure StorageFile is fully visible before ThreeFSFile inherits it.

add span<char> writefile interface

9603f60

restore puttolocalfile

10aa916

fix conflict

9cd7365

SgtPepperr requested a review from xiaguan July 18, 2025 08:02

add document for 3fs usage guide

6221283

xiaguan reviewed Jul 21, 2025

View reviewed changes

add tl::expected<> return type, add use_3fs macro in code

3c98e3f

SgtPepperr requested a review from xiaguan July 21, 2025 13:08

fix test fail

014495b

xiaguan reviewed Jul 22, 2025

View reviewed changes

sgt added 2 commits July 22, 2025 19:55

add check read/write size equals to length

3846033

fix test

15435a8

SgtPepperr requested a review from xiaguan July 22, 2025 12:14

fix conflict

11803ee

xiaguan approved these changes Jul 24, 2025

View reviewed changes

stmatengss reviewed Jul 24, 2025

View reviewed changes

mooncake-store/src/hf3fs/README.md

@@ -0,0 +1,42 @@

# Mooncake HF3FS Plugin

Copy link

Collaborator

stmatengss Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving it to the doc dir.

stmatengss approved these changes Jul 24, 2025

View reviewed changes

stmatengss merged commit 553490c into kvcache-ai:main Jul 24, 2025
10 checks passed

SgtPepperr mentioned this pull request Jul 29, 2025

[Store]feat: Migrate Persistence Metadata from Client to Master Service #690

Merged

stmatengss mentioned this pull request Aug 12, 2025

[RoadMap] Mooncake Store V2 #378

Open

29 tasks

[Store]feat: Add 3fs native api plugin for KVCache storage persistence #610

[Store]feat: Add 3fs native api plugin for KVCache storage persistence #610

Uh oh!

Conversation

SgtPepperr commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Changes Introduced

Key Updates:

Performance Testing

Test Setup

Test Results（3fs Native API BatchGet Throughput Comparison）

Problem

1. Asynchronous Write Performance Issue

2. Read Path Copy Overhead

Uh oh!

xiaguan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Issue Description

Current Status

Uh oh!

xiaguan commented Jul 17, 2025

Uh oh!

LuyuZhang00 commented Jul 19, 2025

Uh oh!

SgtPepperr commented Jul 21, 2025

Uh oh!

xiaguan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xiaguan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SgtPepperr commented Jul 10, 2025 •

edited

Loading