-
Notifications
You must be signed in to change notification settings - Fork 420
[Store]feat: Add 3fs native api plugin for KVCache storage persistence #610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, let’s treat 3fs as an optional plugin, so anyone who skips it sees a clean, untouched codebase.
| } | ||
| } | ||
|
|
||
| ssize_t ThreeFSFile::write(const std::string& buffer, size_t length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perchance, eschew std::string as the buffer argument? Consider adopting a span or a plain char* instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue Description
Attempted to use std::span for data management and called StoreObject, constructing auto buffer = std::make_shared<std::vector<char>>(total_size) in the PutToLocalFile function and converting it to std::span<char> format when calling StoreObject. However, local testing revealed a performance degradation in Put compared to the original implementation.
Current Status
• Interface Layer: Added interfaces with std::span<char> parameters in StorageBackend and FileInterface.
• Call Layer: The asynchronous function PutToLocalFile in the Client side remains unchanged (not yet migrated to std::span).
|
Kindly incorporate a concise performance summary within the pull-request description. |
…ols for clarity, and ensure StorageFile is fully visible before ThreeFSFile inherits it.
|
This is a fantastic feature! Could you please add usage instructions to the documentation? |
|
I will add 3fs feature user guide in documentation soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! A couple tiny nits.
| * @return Number of bytes written on success, -1 on error | ||
| * @note Thread-safe operation with write locking | ||
| */ | ||
| virtual ssize_t write(std::span<const char> data, size_t length) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's duplicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While initially planning to change the write method from string to span<char>, corresponding overloaded methods were implemented in both storage_backend and file_interface. However, testing revealed that using the span<char> format in puttolocalfile() showed poor performance, so the upper layer continues to use the string interface. The overloaded interfaces are retained for potential future optimization after further analysis of the performance issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done. No major changes needed.
Just make sure to handle errors for some edge cases.
| // USRBIO related parameters | ||
| std::string mount_root = "/"; // Mount point root directory | ||
| size_t iov_size = 32 << 20; // Shared memory size (32MB) | ||
| size_t ior_entries = 16; // Maximum number of requests in IO ring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the batch size is greater than 16, what will happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each thread has its own USRBIO resources (iov, ior, etc.), so ior is separeted in batchget now. Besides in the current implementation, only one I/O request is submitted to the ior at a time, waiting for completion before submitting the next, thus avoiding ior overflow (splitting 32MB into 4*8MB I/O requests showed no significant performance gain in local tests, so this approach was not adopted).
|
|
||
| // USRBIO related parameters | ||
| std::string mount_root = "/"; // Mount point root directory | ||
| size_t iov_size = 32 << 20; // Shared memory size (32MB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, if value size bigger than 32MB, what will happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation handles values exceeding iov_size by splitting the operation into multiple read-and-copy iterations within a loop (e.g., for 64MB data, it performs two passes to read into the iov and copy to slices).
| tl::expected<void, ErrorCode> LoadObject(std::string& path, std::string& str, size_t length) ; | ||
|
|
||
| /** | ||
| * @brief Checks if an object with the given key exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't change the order? These changes to Existkey are unnecessary.
| @@ -0,0 +1,42 @@ | |||
| # Mooncake HF3FS Plugin | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving it to the doc dir.
| return make_error<size_t>(ErrorCode::FILE_WRITE_FAIL); | ||
| } | ||
|
|
||
| return total_bytes_written; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the return value is correct
Background
In the previous PR #437, the initial version of KV cache persistence and tiering functionality was implemented. However, since POSIX interfaces were used for read/write operations, the performance in testing was relatively modest.
Changes Introduced
To improve file read/write performance and make the tiered caching functionality fully viable, this PR introduces the 3fs native API(
USRBIO) into the store project as a plugin, significantly reducing the latency of performance-sensitive operations such asgetandbatchgetwhen reading files.Key Updates:
Enables high-performance read operations in 3fs scenarios
file_interfaceImproved support for both POSIX and 3fs read/write modes
Added batchput mode support for KV cache persistence
Added persistence path specification option to
stress_cluster_benchmark.pyPerformance Testing
Test Setup
stress_cluster_benchmark.py(testingbatch_put_fromandbatch_get_into)file_read_thread: 10 (consistent across tests)Test Results(3fs Native API BatchGet Throughput Comparison)
get: disk
get: disk
Problem
1. Asynchronous Write Performance Issue
Summary: Identified performance degradation in 3FS asynchronous writes and proposed solution.
Currently, persistence is achieved through asynchronous writes, but before asynchronous writing in 3FS, significant performance degradation may occur due to data copying. Profiling reveals that the number of page faults triggered in this scenario is nearly double the normal count. Future plans include introducing a reuse buffer list to address this performance degradation issue.
2. Read Path Copy Overhead
Summary: Existing read path copy mechanism and deferred optimization rationale.
Currently, the read path incurs an additional copy (iov → slice) due to 3FS's native API mechanism. This copy could be eliminated by ensuring the upper-layer slice uses shared memory (shm) and directly reuses it as the iov's address space, enabling zero-copy. However, this optimization has been temporarily postponed because it would require: