-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CANN] Improve the Inferencing Performance for Ascend NPU Device #10454
Conversation
Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com>
ggml/src/ggml-cann/ggml-cann.cpp
Outdated
@@ -334,14 +341,19 @@ struct ggml_cann_pool_vmm : public ggml_cann_pool { | |||
std::vector<void*> map_offsets; | |||
|
|||
/** | |||
* @brief Constructor to initialize the buffer pool with virtual memory for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate line.
ggml/src/ggml-cann/aclnn_ops.cpp
Outdated
void ggml_cann_mul_mat(ggml_backend_cann_context& ctx, ggml_tensor* dst) { | ||
const enum ggml_type type = dst->src[0]->type; | ||
switch (type) { | ||
case GGML_TYPE_F32: | ||
case GGML_TYPE_F16: | ||
ggml_cann_mat_mul_fp(ctx, dst); | ||
ggml_cann_mat_mul_fp2(ctx, dst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the old function is not used anymore, It's better to detele them.
int64_t split_size = (src0->ne[1] / max_elem_size) + 1; | ||
ggml_cann_pool_alloc workspace_allocator(ctx.pool()); | ||
aclOpExecutor* executor = nullptr; | ||
uint64_t workspaceSize = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this part remove the limitation of max line length (65536), Please check whether Qwen2-1.5B-Instruct Q8_0 is support now.
ggml/src/ggml-cann/ggml-cann.cpp
Outdated
@@ -457,8 +472,10 @@ struct ggml_cann_pool_vmm : public ggml_cann_pool { | |||
*/ | |||
std::unique_ptr<ggml_cann_pool> ggml_backend_cann_context::new_pool_for_device( | |||
int device) { | |||
// return std::unique_ptr<ggml_cann_pool>(new ggml_cann_pool_leg(device)); | |||
return std::unique_ptr<ggml_cann_pool>(new ggml_cann_pool_vmm(device)); | |||
if (device == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should different device use different memory pool?
ggml/src/ggml-cann/ggml-cann.cpp
Outdated
@@ -470,23 +487,22 @@ std::unique_ptr<ggml_cann_pool> ggml_backend_cann_context::new_pool_for_device( | |||
*/ | |||
struct ggml_backend_cann_buffer_context { | |||
int32_t device; ///< The device ID associated with this buffer context. | |||
void* dev_ptr = | |||
nullptr; ///< Pointer to the device memory allocated for the buffer. | |||
ggml_cann_pool_alloc* alloc; ///< Pointer to the device memory allocated for the buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vmm requires stack memory application and release, that is, the memory is applied first and then released, otherwise an error will occur. Is it possible to guarantee such application and release order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, tensor buffer and kv cache will use this buffer.
ggml/src/ggml-cann/ggml-cann.cpp
Outdated
@@ -1863,17 +1872,17 @@ struct ggml_backend_cann_device_context { | |||
}; | |||
|
|||
static const char * ggml_backend_cann_device_get_name(ggml_backend_dev_t dev) { | |||
ggml_backend_cann_device_context * ctx = (ggml_backend_cann_device_context *)dev->context; | |||
ggml_backend_cann_context * ctx = (ggml_backend_cann_context *)dev->context; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggml_backend_cuda_device_context is the latest code of llama.cpp. Do not change this part.
ggml/src/ggml-cann/ggml-cann.cpp
Outdated
dev_ctx->device = i; | ||
dev_ctx->name = GGML_CANN_NAME + std::to_string(i); | ||
ggml_cann_set_device(i); | ||
ggml_backend_cann_context* dev_ctx = new ggml_backend_cann_context(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not change this part.
ggml/src/ggml-cann/ggml-cann.cpp
Outdated
return nullptr; | ||
} | ||
ggml_cann_set_device(ctx->device); | ||
ggml_backend_dev_t dev = ggml_backend_reg_dev_get(ggml_backend_cann_reg(), device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not change this part.
ggml/src/ggml-cann/ggml-cann.cpp
Outdated
ggml_cann_set_device(buft_ctx->device); | ||
|
||
size = std::max(size, (size_t)1); | ||
ggml_backend_cann_context* cann_ctx = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong data type. Cause a core dump.
ggml/src/ggml-cann/ggml-cann.cpp
Outdated
@@ -470,14 +483,12 @@ std::unique_ptr<ggml_cann_pool> ggml_backend_cann_context::new_pool_for_device( | |||
*/ | |||
struct ggml_backend_cann_buffer_context { | |||
int32_t device; ///< The device ID associated with this buffer context. | |||
void* dev_ptr = | |||
nullptr; ///< Pointer to the device memory allocated for the buffer. | |||
void* dev_ptr = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not modify this line, keep it un-changed.
ggml/src/ggml-cann/ggml-cann.cpp
Outdated
@@ -486,7 +497,7 @@ struct ggml_backend_cann_buffer_context { | |||
/** | |||
* @brief Destructor to free the device memory allocated for the buffer. | |||
*/ | |||
~ggml_backend_cann_buffer_context() { ACL_CHECK(aclrtFree(dev_ptr)); } | |||
~ggml_backend_cann_buffer_context() { ACL_CHECK(aclrtFree(dev_ptr));} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not modify this line, keep it un-changed.
ggml/src/ggml-cann/ggml-cann.cpp
Outdated
|
||
/** | ||
* @brief Constructor to initialize the CANN buffer context. | ||
* | ||
* @param device The device ID associated with this buffer context. | ||
* @param dev_ptr Pointer to the device memory allocated for the buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not modify this line, keep it un-changed.
…ganov#10454) * improve inferencing performance for ascend npu. Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com> * some modification after review * some modifications after review * restore some modifications * restore some modifications --------- Co-authored-by: shanshan shen <shanshanshen333@gmail.com> Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com>
What does this PR do?
Overview
This PR aims to improve the inferencing performance for llama.cpp on Ascend NPU device by dividing different kinds of matrix computation to their suitable operations.
Note
This improvement was implemented by Frank Mai at first, from his repo: llama-box, and I applied the
cann.patch
tollama.cpp
, after which I have made some minor modifications.Environment
Examples
For example, before this optimization, we only use
aclnn_mat_mul
for fp16 type tensor computation.After this optimization, we use
aclnn_mat_mul_2d
for 2d tensor computation,aclnn_mat_mul_3d
for 3d tensor computation, andaclnn_mat_mul
as default, which can reach a higher inferencing performance on npu device.Benchmark
We use model qwen2.5-7b-instruct-fp16.gguf for our benchmark.
Before optimization
Before this optimization, the inference performance is low at 12.70 token/s on Ascend NPU device. The test logs are showed below.
After optimization
After this optimization, the inference performance has been significantlly improved on Ascend NPU device, reaching 35.11 token/s. The test logs are showed below.