Skip to content

Conversation

@SgtPepperr
Copy link
Contributor

@SgtPepperr SgtPepperr commented Jul 10, 2025

Background

In the previous PR #437, the initial version of KV cache persistence and tiering functionality was implemented. However, since POSIX interfaces were used for read/write operations, the performance in testing was relatively modest.

Changes Introduced

To improve file read/write performance and make the tiered caching functionality fully viable, this PR introduces the 3fs native API(USRBIO) into the store project as a plugin, significantly reducing the latency of performance-sensitive operations such as get and batchget when reading files.

Key Updates:

  • Added 3fs native API support
    Enables high-performance read operations in 3fs scenarios
  • Refactored file_interface
    Improved support for both POSIX and 3fs read/write modes
  • Enhanced KV cache persistence
    Added batchput mode support for KV cache persistence
  • Benchmark tool update
    Added persistence path specification option to stress_cluster_benchmark.py

Performance Testing

Test Setup

  • Tool: stress_cluster_benchmark.py (testing batch_put_from and batch_get_into)
  • Workflow:
    1. Prefill on one node
    2. Decode on another node
  • Configuration:
    • file_read_thread: 10 (consistent across tests)
    • Nodes used: 3
      • Node A: General server running:
        • 3fs meta service
        • Mooncake master + meta services
      • Node B: Storage server running:
        • 3fs storage service
        • Mooncake client (prefill/decode)
      • Node C: Storage server running:
        • 3fs storage service
        • Mooncake client (prefill/decode)
    • Data flow: Bidirectional data transfer between Nodes B and C

Test Results(3fs Native API BatchGet Throughput Comparison)

Configuration Batch Count BatchGet Throughput (MB/s)
Value Size: 16MB
put: mem+disk
get: disk
1 3113
4 5247
8 6206
16 6778
Value Size: 64MB
put: mem+disk
get: disk
1 3913
4 7114
8 8169
16 8476

Problem

1. Asynchronous Write Performance Issue

Summary: Identified performance degradation in 3FS asynchronous writes and proposed solution.

Currently, persistence is achieved through asynchronous writes, but before asynchronous writing in 3FS, significant performance degradation may occur due to data copying. Profiling reveals that the number of page faults triggered in this scenario is nearly double the normal count. Future plans include introducing a reuse buffer list to address this performance degradation issue.

2. Read Path Copy Overhead

Summary: Existing read path copy mechanism and deferred optimization rationale.

Currently, the read path incurs an additional copy (iov → slice) due to 3FS's native API mechanism. This copy could be eliminated by ensuring the upper-layer slice uses shared memory (shm) and directly reuses it as the iov's address space, enabling zero-copy. However, this optimization has been temporarily postponed because it would require:

  • Modifications to the upper Python interface
  • Exposure of 3FS implementation details to the application layer
  • codebase intrusiveness

@SgtPepperr SgtPepperr marked this pull request as draft July 10, 2025 07:45
@SgtPepperr SgtPepperr marked this pull request as ready for review July 16, 2025 07:27
@xiaguan xiaguan self-requested a review July 17, 2025 06:29
Copy link
Collaborator

@xiaguan xiaguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, let’s treat 3fs as an optional plugin, so anyone who skips it sees a clean, untouched codebase.

}
}

ssize_t ThreeFSFile::write(const std::string& buffer, size_t length) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perchance, eschew std::string as the buffer argument? Consider adopting a span or a plain char* instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue Description

Attempted to use std::span for data management and called StoreObject, constructing auto buffer = std::make_shared<std::vector<char>>(total_size) in the PutToLocalFile function and converting it to std::span<char> format when calling StoreObject. However, local testing revealed a performance degradation in Put compared to the original implementation.

Current Status

• Interface Layer: Added interfaces with std::span<char> parameters in StorageBackend and FileInterface.

• Call Layer: The asynchronous function PutToLocalFile in the Client side remains unchanged (not yet migrated to std::span).

@xiaguan
Copy link
Collaborator

xiaguan commented Jul 17, 2025

Kindly incorporate a concise performance summary within the pull-request description.

sgt added 4 commits July 18, 2025 11:20
@SgtPepperr SgtPepperr requested a review from xiaguan July 18, 2025 08:02
@LuyuZhang00
Copy link
Contributor

This is a fantastic feature! Could you please add usage instructions to the documentation?

@SgtPepperr
Copy link
Contributor Author

I will add 3fs feature user guide in documentation soon.

Copy link
Collaborator

@xiaguan xiaguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! A couple tiny nits.

* @return Number of bytes written on success, -1 on error
* @note Thread-safe operation with write locking
*/
virtual ssize_t write(std::span<const char> data, size_t length) = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's duplicated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While initially planning to change the write method from string to span<char>, corresponding overloaded methods were implemented in both storage_backend and file_interface. However, testing revealed that using the span<char> format in puttolocalfile() showed poor performance, so the upper layer continues to use the string interface. The overloaded interfaces are retained for potential future optimization after further analysis of the performance issues.

@SgtPepperr SgtPepperr requested a review from xiaguan July 21, 2025 13:08
Copy link
Collaborator

@xiaguan xiaguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done. No major changes needed.

Just make sure to handle errors for some edge cases.

// USRBIO related parameters
std::string mount_root = "/"; // Mount point root directory
size_t iov_size = 32 << 20; // Shared memory size (32MB)
size_t ior_entries = 16; // Maximum number of requests in IO ring
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the batch size is greater than 16, what will happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each thread has its own USRBIO resources (iov, ior, etc.), so ior is separeted in batchget now. Besides in the current implementation, only one I/O request is submitted to the ior at a time, waiting for completion before submitting the next, thus avoiding ior overflow (splitting 32MB into 4*8MB I/O requests showed no significant performance gain in local tests, so this approach was not adopted).


// USRBIO related parameters
std::string mount_root = "/"; // Mount point root directory
size_t iov_size = 32 << 20; // Shared memory size (32MB)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, if value size bigger than 32MB, what will happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation handles values exceeding iov_size by splitting the operation into multiple read-and-copy iterations within a loop (e.g., for 64MB data, it performs two passes to read into the iov and copy to slices).

@SgtPepperr SgtPepperr requested a review from xiaguan July 22, 2025 12:14
tl::expected<void, ErrorCode> LoadObject(std::string& path, std::string& str, size_t length) ;

/**
* @brief Checks if an object with the given key exists
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't change the order? These changes to Existkey are unnecessary.

@@ -0,0 +1,42 @@
# Mooncake HF3FS Plugin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving it to the doc dir.

return make_error<size_t>(ErrorCode::FILE_WRITE_FAIL);
}

return total_bytes_written;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the return value is correct

@stmatengss stmatengss merged commit 553490c into kvcache-ai:main Jul 24, 2025
10 checks passed
@stmatengss stmatengss mentioned this pull request Aug 12, 2025
29 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants