-
Notifications
You must be signed in to change notification settings - Fork 420
Pooling async memecpy v1 #986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pooling async memecpy v1 #986
Conversation
Summary of ChangesHello @AscendTransport, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly advances the integration and optimization for Ascend NPU devices within the system. It introduces a more efficient memory copy strategy by leveraging Ascend-specific APIs and refactors the core transfer engine for better resource sharing and pooling. The changes also include new batch data transfer functionalities tailored for Ascend, alongside a comprehensive overhaul of the underlying transport layer to support aggregated operations, aiming to boost overall performance and scalability on NPU hardware. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant changes to support Ascend NPUs and implements a pooling mechanism for the TransferEngine. The changes are extensive, touching the build system, core store and transfer engine logic, Python bindings, and adding new transport layers for Ascend. While the overall direction seems correct, there are several issues that need to be addressed. I've found a critical bug in the RPC service logic, some design concerns regarding encapsulation and use of global variables, and several inconsistencies and leftover debug code. Please review my comments for details.
| for (size_t i = 0; i < keys.size(); ++i) { | ||
| slice_len.reserve(keys.size()); | ||
| all_slice_len = 0; | ||
| for (size_t j = 0; j < slice_lengths[i].size(); ++j) { | ||
| all_slice_len += slice_lengths[i][j]; | ||
| } | ||
| slice_len.emplace_back(all_slice_len); | ||
| // LOG(ERROR) << "master_server put start, len:" << slice_lengths[i].size(); | ||
| results.emplace_back( | ||
| master_service_.PutStart(keys[i], slice_lengths[i], config)); | ||
| master_service_.PutStart(keys[i], slice_len, config)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bug in the BatchPutStart implementation. The slice_len vector is not cleared within the loop, causing it to accumulate total sizes from previous keys. For the i-th key, master_service_.PutStart is called with a slice_len vector containing total sizes for keys 0 to i, instead of just for key i. This will likely lead to incorrect behavior or errors in the master service.
Additionally, slice_len.reserve(keys.size()); is called inside the loop, which is inefficient. It should be moved outside or removed if only one element is ever needed.
for (size_t i = 0; i < keys.size(); ++i) {
uint64_t all_slice_len = 0;
for (size_t j = 0; j < slice_lengths[i].size(); ++j) {
all_slice_len += slice_lengths[i][j];
}
// LOG(ERROR) << "master_server put start, len:" << slice_lengths[i].size();
results.emplace_back(
master_service_.PutStart(keys[i], {all_slice_len}, config));
}| if (replica.is_memory_replica() == false) { | ||
| key_slices.emplace_back(Slice{buffers[j], sizes[j]}); | ||
| } else { | ||
| key_slices.emplace_back(Slice{buffers[j], sizes[j]}); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| const size_t num_keys = 1; | ||
| std::vector<tl::expected<int64_t, ErrorCode>> results; | ||
| results.reserve(num_keys); | ||
|
|
||
| if (num_keys == 0) { | ||
| return results; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code const size_t num_keys = 1; followed by if (num_keys == 0) is dead code since num_keys is a compile-time constant and the condition will always be false. This should be removed to improve code clarity.
const size_t num_keys = 1;
std::vector<tl::expected<int64_t, ErrorCode>> results;
results.reserve(num_keys);
| this->local_hostname = local_hostname; | ||
| } | ||
|
|
||
| LOG(ERROR) << "setup_internal local_hostname:" << this->local_hostname; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log message uses LOG(ERROR) for what appears to be a debug/informational message. Using ERROR level for non-error conditions can clutter logs and make it harder to find real errors. Please consider changing this to LOG(INFO) or VLOG(1).
| LOG(ERROR) << "setup_internal local_hostname:" << this->local_hostname; | |
| LOG(INFO) << "setup_internal local_hostname:" << this->local_hostname; |
| LOG(ERROR) << "batch put keys size:" << keys.size() << ", ordered_batched_slices size:" << ordered_batched_slices.size() | ||
| << ", slice size len:" << slices.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LOG(ERROR) message appears to be for debugging purposes. Please consider removing it or changing it to a lower severity level like LOG(INFO) or VLOG to avoid cluttering error logs.
LOG(INFO) << "batch put keys size:" << keys.size() << ", ordered_batched_slices size:" << ordered_batched_slices.size()
<< ", slice size len:" << slices.size();| // auto start = std::chrono::high_resolution_clock::now(); | ||
|
|
||
| auto internal_results = batch_get_into_internal_ascend(key, buffers, sizes); | ||
| std::vector<int> results; | ||
| results.reserve(internal_results.size()); | ||
|
|
||
| for (const auto &result : internal_results) { | ||
| results.push_back(to_py_ret(result)); | ||
| } | ||
| // auto stop = std::chrono::high_resolution_clock::now(); | ||
| // auto duration_call = | ||
| // std::chrono::duration_cast<std::chrono::microseconds>(stop - start); | ||
| // LOG(INFO) << "key: " << key << ", batch_get_into_ascend: " << duration_call.count() << "us"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function contains commented-out code for performance measurement. This should be removed before merging to keep the codebase clean.
| // auto start = std::chrono::high_resolution_clock::now(); | |
| auto internal_results = batch_get_into_internal_ascend(key, buffers, sizes); | |
| std::vector<int> results; | |
| results.reserve(internal_results.size()); | |
| for (const auto &result : internal_results) { | |
| results.push_back(to_py_ret(result)); | |
| } | |
| // auto stop = std::chrono::high_resolution_clock::now(); | |
| // auto duration_call = | |
| // std::chrono::duration_cast<std::chrono::microseconds>(stop - start); | |
| // LOG(INFO) << "key: " << key << ", batch_get_into_ascend: " << duration_call.count() << "us"; | |
| auto internal_results = batch_get_into_internal_ascend(key, buffers, sizes); | |
| std::vector<int> results; | |
| results.reserve(internal_results.size()); | |
| for (const auto &result : internal_results) { | |
| results.push_back(to_py_ret(result)); | |
| } | |
| return results; |
|
|
||
| std::vector<Transport *> listTransports(); | ||
|
|
||
| std::map<std::string, std::shared_ptr<Transport>> transport_map_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return local_topology_; | ||
| } | ||
|
|
||
| std::string local_server_name_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| extern __attribute__ ((visibility ("default"))) std::shared_ptr<TransferEngine> g_transfer_engine; | ||
| extern __attribute__ ((visibility ("default"))) bool g_separate_pool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The introduction of global variables g_transfer_engine and g_separate_pool for pooling is a design concern. Global state can make the code harder to reason about, test, and maintain. It also introduces tight coupling between different parts of the system. Have you considered alternative approaches, such as dependency injection or a singleton pattern with controlled access, to manage the shared TransferEngine instance?
| LOG(WARNING) << "Transport " << proto << " already installed"; | ||
| return transport; | ||
| } | ||
| LOG(WARNING) << "Transport not used"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Conflict resolution in progress |
3b6788e to
8fce1ff
Compare
No description provided.