Pooling async memecpy v1 #986

AscendTransport · 2025-10-30T08:42:24Z

No description provided.

gemini-code-assist · 2025-10-30T08:43:28Z

Summary of Changes

Hello @AscendTransport, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly advances the integration and optimization for Ascend NPU devices within the system. It introduces a more efficient memory copy strategy by leveraging Ascend-specific APIs and refactors the core transfer engine for better resource sharing and pooling. The changes also include new batch data transfer functionalities tailored for Ascend, alongside a comprehensive overhaul of the underlying transport layer to support aggregated operations, aiming to boost overall performance and scalability on NPU hardware.

Highlights

Ascend NPU Integration: Introduced specialized batch put and get operations (batch_put_from_ascend, batch_get_into_ascend) for Ascend NPUs, enhancing data transfer capabilities for these devices.
Memory Copy Optimization: Refactored memory copy operations within the MemcpyWorkerPool to utilize Ascend-specific aclrtMemcpy functions, replacing generic std::memcpy for improved performance on NPU devices.
Transfer Engine Pooling: Modified the TransferEngine to use std::shared_ptr and introduced global pooling mechanisms (g_transfer_engine, g_separate_pool) to allow for shared and multiplexed use of the transfer engine across different components.
Aggregated Transport Logic: Implemented new aggregated transport mechanisms within the HCCL transport layer, including dedicated threads for initiator and target roles, to optimize batch data transfers by grouping smaller operations.
Build System and Configuration: Updated CMake configurations to enable USE_ASCEND by default and allow flexible specification of the Ascend toolkit path via environment variables, streamlining the build process for Ascend-enabled environments.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant changes to support Ascend NPUs and implements a pooling mechanism for the TransferEngine. The changes are extensive, touching the build system, core store and transfer engine logic, Python bindings, and adding new transport layers for Ascend. While the overall direction seems correct, there are several issues that need to be addressed. I've found a critical bug in the RPC service logic, some design concerns regarding encapsulation and use of global variables, and several inconsistencies and leftover debug code. Please review my comments for details.

gemini-code-assist · 2025-10-30T08:45:45Z

mooncake-store/src/rpc_service.cpp

    for (size_t i = 0; i < keys.size(); ++i) {
+        slice_len.reserve(keys.size());
+        all_slice_len = 0;
+        for (size_t j = 0; j < slice_lengths[i].size(); ++j) {
+            all_slice_len += slice_lengths[i][j];
+        }
+        slice_len.emplace_back(all_slice_len);
+        // LOG(ERROR) << "master_server put start, len:" << slice_lengths[i].size();
        results.emplace_back(
-            master_service_.PutStart(keys[i], slice_lengths[i], config));
+            master_service_.PutStart(keys[i], slice_len, config));
    }


There's a bug in the BatchPutStart implementation. The slice_len vector is not cleared within the loop, causing it to accumulate total sizes from previous keys. For the i-th key, master_service_.PutStart is called with a slice_len vector containing total sizes for keys 0 to i, instead of just for key i. This will likely lead to incorrect behavior or errors in the master service.

Additionally, slice_len.reserve(keys.size()); is called inside the loop, which is inefficient. It should be moved outside or removed if only one element is ever needed.

for (size_t i = 0; i < keys.size(); ++i) { uint64_t all_slice_len = 0; for (size_t j = 0; j < slice_lengths[i].size(); ++j) { all_slice_len += slice_lengths[i][j]; } // LOG(ERROR) << "master_server put start, len:" << slice_lengths[i].size(); results.emplace_back( master_service_.PutStart(keys[i], {all_slice_len}, config)); }

gemini-code-assist · 2025-10-30T08:45:45Z