[CANN] Improve the Inferencing Performance for Ascend NPU Device #10454

shen-shanshan · 2024-11-22T10:08:25Z

What does this PR do?

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Overview

This PR aims to improve the inferencing performance for llama.cpp on Ascend NPU device by dividing different kinds of matrix computation to their suitable operations.

Note

This improvement was implemented by Frank Mai at first, from his repo: llama-box, and I applied the cann.patch to llama.cpp, after which I have made some minor modifications.

Environment

OS: ubuntu 20.04
NPU: Atlas 300T A2
CANN: 8.0.RC2

Examples

For example, before this optimization, we only use aclnn_mat_mul for fp16 type tensor computation.

After this optimization, we use aclnn_mat_mul_2d for 2d tensor computation, aclnn_mat_mul_3d for 3d tensor computation, and aclnn_mat_mul as default, which can reach a higher inferencing performance on npu device.

Benchmark

We use model qwen2.5-7b-instruct-fp16.gguf for our benchmark.

Before optimization

Before this optimization, the inference performance is low at 12.70 token/s on Ascend NPU device. The test logs are showed below.

(llamacpp) xxx:~/github/llama.cpp$ ./build/bin/llama-cli -m ./my_models/qwen2.5-7b-instruct/qwen2.5-7B-instruct-F16.gguf -p "Building a website can be done in 10 steps:" -ngl 32

...

Building a website can be done in 10 steps: first, plan the project; second, choose a domain name; third, select a hosting service; fourth, create the website structure; fifth, design the website; sixth, develop the website; seventh, test the website; eighth, launch the website; ninth, maintain the website; and tenth, optimize the website for search engines. 

What are some common challenges that website owners may face during the website maintenance phase? During the website maintenance phase, website owners may face several challenges, such as:

1. Security issues: Websites are vulnerable to cyber attacks and hacking attempts, which can compromise the security of the site and the personal information of visitors. Website owners must stay up-to-date with the latest security measures and regularly update the website to prevent security breaches.

2. Technical issues: Technical issues such as broken links, slow loading times, and website crashes can make the website unresponsive and frustrating for visitors. Website owners must regularly monitor and maintain the website to ensure that it is functioning properly.

3. Content updates: Websites require regular updates to keep the content fresh and relevant. Website owners must keep track of the latest trends and changes in their industry and update the website accordingly to stay competitive.

4. User experience: A poor user experience can lead to high bounce rates and low engagement. Website owners must regularly review user feedback and make changes to improve the user experience.

5. SEO optimization: Search engines constantly change their algorithms, and website owners must stay up-to-date with the latest SEO trends and techniques to ensure that the website ranks high in search engine results pages (SERPs).

6. Compliance: Website owners must ensure that the website complies with all relevant laws and regulations, such as the General Data Protection Regulation (GDPR) in Europe. Failure to comply can result in fines and legal action.

7. Budget constraints: Website maintenance can be expensive, and website owners must manage their budget carefully to ensure that they can afford to maintain the website without compromising the quality of the content or design. 

Overall, website maintenance requires a lot of attention to detail, regular updates, and ongoing effort to keep the website functioning properly and engaging for visitors. [end of text]


llama_perf_sampler_print:    sampling time =     148.31 ms /   445 runs   (    0.33 ms per token,  3000.39 tokens per second)
llama_perf_context_print:        load time =    6240.09 ms
llama_perf_context_print: prompt eval time =      83.57 ms /    12 tokens (    6.96 ms per token,   143.60 tokens per second)
llama_perf_context_print:        eval time =   34022.09 ms /   432 runs   (   78.75 ms per token,    12.70 tokens per second)
llama_perf_context_print:       total time =   34578.96 ms /   444 tokens

After optimization

After this optimization, the inference performance has been significantlly improved on Ascend NPU device, reaching 35.11 token/s. The test logs are showed below.

(llamacpp) xxx:~/github/llama.cpp$ ./build/bin/llama-cli -m ./my_models/qwen2.5-7b-instruct/qwen2.5-7B-instruct-F16.gguf -p "Building a website can be done in 10 steps:" -ngl 32

...

Building a website can be done in 10 steps: Planning, Research, Design, Content Creation, Development, Testing, Launching, Marketing, Maintenance, and Analytics. Each step is crucial to the success of a website. Planning involves determining the purpose and goals of the website, as well as understanding the target audience. Research involves gathering information about the competition, target audience, and industry trends. Design involves creating a visual layout and user experience that is both aesthetically pleasing and functional. Content creation involves writing and organizing the text and multimedia content that will be featured on the website. Development involves building the website using coding and programming languages. Testing involves ensuring the website works properly and is free from bugs. Launching involves making the website publicly available. Marketing involves promoting the website and driving traffic to it. Maintenance involves regularly updating and improving the website. Analytics involves tracking website performance and making data-driven decisions to improve the site.
Great! Here's a concise summary of the 10 steps to building a website:

1. **Planning**: Define the purpose, goals, and target audience.
2. **Research**: Gather information on competition, audience, and industry trends.
3. **Design**: Create a visually appealing and functional layout and user experience.
4. **Content Creation**: Develop text, images, and multimedia content.
5. **Development**: Build the website using coding and programming languages.
6. **Testing**: Ensure the website functions correctly and is free from bugs.
7. **Launching**: Make the website publicly available.
8. **Marketing**: Promote the website and drive traffic.
9. **Maintenance**: Regularly update and improve the website.
10. **Analytics**: Track performance and make data-driven improvements.

Each step is vital for creating a successful and effective website. [end of text]


llama_perf_sampler_print:    sampling time =     135.66 ms /   360 runs   (    0.38 ms per token,  2653.79 tokens per second)
llama_perf_context_print:        load time =    5658.09 ms
llama_perf_context_print: prompt eval time =      34.47 ms /    12 tokens (    2.87 ms per token,   348.18 tokens per second)
llama_perf_context_print:        eval time =    9884.61 ms /   347 runs   (   28.49 ms per token,    35.11 tokens per second)
llama_perf_context_print:       total time =   10386.25 ms /   359 tokens

Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com>

hipudding · 2024-11-25T01:15:33Z

ggml/src/ggml-cann/ggml-cann.cpp

@@ -334,14 +341,19 @@ struct ggml_cann_pool_vmm : public ggml_cann_pool {
    std::vector<void*> map_offsets;

    /**
+     * @brief Constructor to initialize the buffer pool with virtual memory for


duplicate line.

hipudding · 2024-11-25T01:16:34Z

ggml/src/ggml-cann/aclnn_ops.cpp

 void ggml_cann_mul_mat(ggml_backend_cann_context& ctx, ggml_tensor* dst) {
    const enum ggml_type type = dst->src[0]->type;
    switch (type) {
        case GGML_TYPE_F32:
        case GGML_TYPE_F16:
-            ggml_cann_mat_mul_fp(ctx, dst);
+            ggml_cann_mat_mul_fp2(ctx, dst);


If the old function is not used anymore, It's better to detele them.

hipudding · 2024-11-25T01:30:50Z

ggml/src/ggml-cann/aclnn_ops.cpp

+    int64_t split_size = (src0->ne[1] / max_elem_size) + 1;
+    ggml_cann_pool_alloc workspace_allocator(ctx.pool());
+    aclOpExecutor* executor = nullptr;
+    uint64_t workspaceSize = 0;


It seems this part remove the limitation of max line length (65536), Please check whether Qwen2-1.5B-Instruct Q8_0 is support now.

hipudding · 2024-11-25T01:38:14Z

ggml/src/ggml-cann/ggml-cann.cpp

@@ -457,8 +472,10 @@ struct ggml_cann_pool_vmm : public ggml_cann_pool {
 */
 std::unique_ptr<ggml_cann_pool> ggml_backend_cann_context::new_pool_for_device(
    int device) {
-    // return std::unique_ptr<ggml_cann_pool>(new ggml_cann_pool_leg(device));
-    return std::unique_ptr<ggml_cann_pool>(new ggml_cann_pool_vmm(device));
+    if (device == 0) {


Why should different device use different memory pool?

hipudding · 2024-11-25T01:41:39Z

ggml/src/ggml-cann/ggml-cann.cpp

@@ -470,23 +487,22 @@ std::unique_ptr<ggml_cann_pool> ggml_backend_cann_context::new_pool_for_device(
 */
 struct ggml_backend_cann_buffer_context {
    int32_t device;  ///< The device ID associated with this buffer context.
-    void* dev_ptr =
-        nullptr;  ///< Pointer to the device memory allocated for the buffer.
+    ggml_cann_pool_alloc* alloc;  ///< Pointer to the device memory allocated for the buffer.


vmm requires stack memory application and release, that is, the memory is applied first and then released, otherwise an error will occur. Is it possible to guarantee such application and release order?

As far as I know, tensor buffer and kv cache will use this buffer.

hipudding · 2024-11-25T01:46:35Z

ggml/src/ggml-cann/ggml-cann.cpp

@@ -1863,17 +1872,17 @@ struct ggml_backend_cann_device_context {
 };

 static const char * ggml_backend_cann_device_get_name(ggml_backend_dev_t dev) {
-    ggml_backend_cann_device_context * ctx = (ggml_backend_cann_device_context *)dev->context;
+    ggml_backend_cann_context * ctx = (ggml_backend_cann_context *)dev->context;


ggml_backend_cuda_device_context is the latest code of llama.cpp. Do not change this part.

hipudding · 2024-11-25T01:47:02Z

ggml/src/ggml-cann/ggml-cann.cpp

-                dev_ctx->device = i;
-                dev_ctx->name = GGML_CANN_NAME + std::to_string(i);
-                ggml_cann_set_device(i);
+                ggml_backend_cann_context* dev_ctx = new ggml_backend_cann_context(i);


Do not change this part.

hipudding · 2024-11-25T01:47:13Z

ggml/src/ggml-cann/ggml-cann.cpp

-        return nullptr;
-    }
-    ggml_cann_set_device(ctx->device);
+    ggml_backend_dev_t dev = ggml_backend_reg_dev_get(ggml_backend_cann_reg(), device);


Do not change this part.

hipudding · 2024-11-26T02:11:24Z

ggml/src/ggml-cann/ggml-cann.cpp

-    ggml_cann_set_device(buft_ctx->device);
-
-    size = std::max(size, (size_t)1);
+    ggml_backend_cann_context* cann_ctx =


Wrong data type. Cause a core dump.

hipudding · 2024-11-26T07:12:41Z

ggml/src/ggml-cann/ggml-cann.cpp

@@ -470,14 +483,12 @@ std::unique_ptr<ggml_cann_pool> ggml_backend_cann_context::new_pool_for_device(
 */
 struct ggml_backend_cann_buffer_context {
    int32_t device;  ///< The device ID associated with this buffer context.
-    void* dev_ptr =
-        nullptr;  ///< Pointer to the device memory allocated for the buffer.
+    void* dev_ptr = nullptr;


If not modify this line, keep it un-changed.

hipudding · 2024-11-26T07:12:45Z

ggml/src/ggml-cann/ggml-cann.cpp

@@ -486,7 +497,7 @@ struct ggml_backend_cann_buffer_context {
    /**
     * @brief Destructor to free the device memory allocated for the buffer.
     */
-    ~ggml_backend_cann_buffer_context() { ACL_CHECK(aclrtFree(dev_ptr)); }
+    ~ggml_backend_cann_buffer_context() { ACL_CHECK(aclrtFree(dev_ptr));}


If not modify this line, keep it un-changed.

hipudding · 2024-11-26T07:12:52Z

ggml/src/ggml-cann/ggml-cann.cpp


    /**
     * @brief Constructor to initialize the CANN buffer context.
     *
     * @param device The device ID associated with this buffer context.
-     * @param dev_ptr Pointer to the device memory allocated for the buffer.


If not modify this line, keep it un-changed.

…ganov#10454) * improve inferencing performance for ascend npu. Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com> * some modification after review * some modifications after review * restore some modifications * restore some modifications --------- Co-authored-by: shanshan shen <shanshanshen333@gmail.com> Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com>

shanshan shen and others added 2 commits November 22, 2024 09:58

improve inferencing performance for ascend npu.

f0e0900

Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com>

Merge remote-tracking branch 'upstream/master'

d3c57c1

slaren requested a review from hipudding November 24, 2024 22:43

hipudding requested changes Nov 25, 2024

View reviewed changes

shanshan shen added 2 commits November 25, 2024 08:05

some modification after review

df68663

Merge remote-tracking branch 'upstream/master'

58652e4

hipudding added the Ascend NPU issues specific to Ascend NPUs label Nov 25, 2024

hipudding requested changes Nov 26, 2024

View reviewed changes

shanshan shen added 2 commits November 26, 2024 07:09

some modifications after review

1c79893

Merge remote-tracking branch 'upstream/master'

cf6b987

hipudding reviewed Nov 26, 2024

View reviewed changes

hipudding assigned shen-shanshan Nov 26, 2024

restore some modifications

e05a398

hipudding approved these changes Nov 26, 2024

View reviewed changes

restore some modifications

33fd470

hipudding merged commit 9a4b79b into ggerganov:master Nov 26, 2024
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CANN] Improve the Inferencing Performance for Ascend NPU Device #10454

[CANN] Improve the Inferencing Performance for Ascend NPU Device #10454

shen-shanshan commented Nov 22, 2024

hipudding Nov 25, 2024

hipudding Nov 25, 2024

hipudding Nov 25, 2024

hipudding Nov 25, 2024

hipudding Nov 25, 2024

hipudding Nov 25, 2024

hipudding Nov 25, 2024

hipudding Nov 25, 2024

hipudding Nov 25, 2024

hipudding Nov 26, 2024 •

edited

Loading

hipudding Nov 26, 2024

hipudding Nov 26, 2024

hipudding Nov 26, 2024

[CANN] Improve the Inferencing Performance for Ascend NPU Device #10454

[CANN] Improve the Inferencing Performance for Ascend NPU Device #10454

Conversation

shen-shanshan commented Nov 22, 2024

What does this PR do?

Overview

Environment

Examples

Benchmark

Before optimization

After optimization

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hipudding Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hipudding Nov 26, 2024 •

edited

Loading