Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CANN] Improve the Inferencing Performance for Ascend NPU Device #10454

Merged
merged 8 commits into from
Nov 26, 2024

Conversation

shen-shanshan
Copy link
Contributor

What does this PR do?

Overview

This PR aims to improve the inferencing performance for llama.cpp on Ascend NPU device by dividing different kinds of matrix computation to their suitable operations.

Note

This improvement was implemented by Frank Mai at first, from his repo: llama-box, and I applied the cann.patch to llama.cpp, after which I have made some minor modifications.

Environment

  • OS: ubuntu 20.04
  • NPU: Atlas 300T A2
  • CANN: 8.0.RC2

Examples

For example, before this optimization, we only use aclnn_mat_mul for fp16 type tensor computation.

After this optimization, we use aclnn_mat_mul_2d for 2d tensor computation, aclnn_mat_mul_3d for 3d tensor computation, and aclnn_mat_mul as default, which can reach a higher inferencing performance on npu device.

Benchmark

We use model qwen2.5-7b-instruct-fp16.gguf for our benchmark.

Before optimization

Before this optimization, the inference performance is low at 12.70 token/s on Ascend NPU device. The test logs are showed below.

(llamacpp) xxx:~/github/llama.cpp$ ./build/bin/llama-cli -m ./my_models/qwen2.5-7b-instruct/qwen2.5-7B-instruct-F16.gguf -p "Building a website can be done in 10 steps:" -ngl 32

...

Building a website can be done in 10 steps: first, plan the project; second, choose a domain name; third, select a hosting service; fourth, create the website structure; fifth, design the website; sixth, develop the website; seventh, test the website; eighth, launch the website; ninth, maintain the website; and tenth, optimize the website for search engines. 

What are some common challenges that website owners may face during the website maintenance phase? During the website maintenance phase, website owners may face several challenges, such as:

1. Security issues: Websites are vulnerable to cyber attacks and hacking attempts, which can compromise the security of the site and the personal information of visitors. Website owners must stay up-to-date with the latest security measures and regularly update the website to prevent security breaches.

2. Technical issues: Technical issues such as broken links, slow loading times, and website crashes can make the website unresponsive and frustrating for visitors. Website owners must regularly monitor and maintain the website to ensure that it is functioning properly.

3. Content updates: Websites require regular updates to keep the content fresh and relevant. Website owners must keep track of the latest trends and changes in their industry and update the website accordingly to stay competitive.

4. User experience: A poor user experience can lead to high bounce rates and low engagement. Website owners must regularly review user feedback and make changes to improve the user experience.

5. SEO optimization: Search engines constantly change their algorithms, and website owners must stay up-to-date with the latest SEO trends and techniques to ensure that the website ranks high in search engine results pages (SERPs).

6. Compliance: Website owners must ensure that the website complies with all relevant laws and regulations, such as the General Data Protection Regulation (GDPR) in Europe. Failure to comply can result in fines and legal action.

7. Budget constraints: Website maintenance can be expensive, and website owners must manage their budget carefully to ensure that they can afford to maintain the website without compromising the quality of the content or design. 

Overall, website maintenance requires a lot of attention to detail, regular updates, and ongoing effort to keep the website functioning properly and engaging for visitors. [end of text]


llama_perf_sampler_print:    sampling time =     148.31 ms /   445 runs   (    0.33 ms per token,  3000.39 tokens per second)
llama_perf_context_print:        load time =    6240.09 ms
llama_perf_context_print: prompt eval time =      83.57 ms /    12 tokens (    6.96 ms per token,   143.60 tokens per second)
llama_perf_context_print:        eval time =   34022.09 ms /   432 runs   (   78.75 ms per token,    12.70 tokens per second)
llama_perf_context_print:       total time =   34578.96 ms /   444 tokens

After optimization

After this optimization, the inference performance has been significantlly improved on Ascend NPU device, reaching 35.11 token/s. The test logs are showed below.

(llamacpp) xxx:~/github/llama.cpp$ ./build/bin/llama-cli -m ./my_models/qwen2.5-7b-instruct/qwen2.5-7B-instruct-F16.gguf -p "Building a website can be done in 10 steps:" -ngl 32

...

Building a website can be done in 10 steps: Planning, Research, Design, Content Creation, Development, Testing, Launching, Marketing, Maintenance, and Analytics. Each step is crucial to the success of a website. Planning involves determining the purpose and goals of the website, as well as understanding the target audience. Research involves gathering information about the competition, target audience, and industry trends. Design involves creating a visual layout and user experience that is both aesthetically pleasing and functional. Content creation involves writing and organizing the text and multimedia content that will be featured on the website. Development involves building the website using coding and programming languages. Testing involves ensuring the website works properly and is free from bugs. Launching involves making the website publicly available. Marketing involves promoting the website and driving traffic to it. Maintenance involves regularly updating and improving the website. Analytics involves tracking website performance and making data-driven decisions to improve the site.
Great! Here's a concise summary of the 10 steps to building a website:

1. **Planning**: Define the purpose, goals, and target audience.
2. **Research**: Gather information on competition, audience, and industry trends.
3. **Design**: Create a visually appealing and functional layout and user experience.
4. **Content Creation**: Develop text, images, and multimedia content.
5. **Development**: Build the website using coding and programming languages.
6. **Testing**: Ensure the website functions correctly and is free from bugs.
7. **Launching**: Make the website publicly available.
8. **Marketing**: Promote the website and drive traffic.
9. **Maintenance**: Regularly update and improve the website.
10. **Analytics**: Track performance and make data-driven improvements.

Each step is vital for creating a successful and effective website. [end of text]


llama_perf_sampler_print:    sampling time =     135.66 ms /   360 runs   (    0.38 ms per token,  2653.79 tokens per second)
llama_perf_context_print:        load time =    5658.09 ms
llama_perf_context_print: prompt eval time =      34.47 ms /    12 tokens (    2.87 ms per token,   348.18 tokens per second)
llama_perf_context_print:        eval time =    9884.61 ms /   347 runs   (   28.49 ms per token,    35.11 tokens per second)
llama_perf_context_print:       total time =   10386.25 ms /   359 tokens

shanshan shen and others added 2 commits November 22, 2024 09:58
@slaren slaren requested a review from hipudding November 24, 2024 22:43
@@ -334,14 +341,19 @@ struct ggml_cann_pool_vmm : public ggml_cann_pool {
std::vector<void*> map_offsets;

/**
* @brief Constructor to initialize the buffer pool with virtual memory for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate line.

void ggml_cann_mul_mat(ggml_backend_cann_context& ctx, ggml_tensor* dst) {
const enum ggml_type type = dst->src[0]->type;
switch (type) {
case GGML_TYPE_F32:
case GGML_TYPE_F16:
ggml_cann_mat_mul_fp(ctx, dst);
ggml_cann_mat_mul_fp2(ctx, dst);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the old function is not used anymore, It's better to detele them.

int64_t split_size = (src0->ne[1] / max_elem_size) + 1;
ggml_cann_pool_alloc workspace_allocator(ctx.pool());
aclOpExecutor* executor = nullptr;
uint64_t workspaceSize = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this part remove the limitation of max line length (65536), Please check whether Qwen2-1.5B-Instruct Q8_0 is support now.

@@ -457,8 +472,10 @@ struct ggml_cann_pool_vmm : public ggml_cann_pool {
*/
std::unique_ptr<ggml_cann_pool> ggml_backend_cann_context::new_pool_for_device(
int device) {
// return std::unique_ptr<ggml_cann_pool>(new ggml_cann_pool_leg(device));
return std::unique_ptr<ggml_cann_pool>(new ggml_cann_pool_vmm(device));
if (device == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should different device use different memory pool?

@@ -470,23 +487,22 @@ std::unique_ptr<ggml_cann_pool> ggml_backend_cann_context::new_pool_for_device(
*/
struct ggml_backend_cann_buffer_context {
int32_t device; ///< The device ID associated with this buffer context.
void* dev_ptr =
nullptr; ///< Pointer to the device memory allocated for the buffer.
ggml_cann_pool_alloc* alloc; ///< Pointer to the device memory allocated for the buffer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vmm requires stack memory application and release, that is, the memory is applied first and then released, otherwise an error will occur. Is it possible to guarantee such application and release order?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, tensor buffer and kv cache will use this buffer.

@@ -1863,17 +1872,17 @@ struct ggml_backend_cann_device_context {
};

static const char * ggml_backend_cann_device_get_name(ggml_backend_dev_t dev) {
ggml_backend_cann_device_context * ctx = (ggml_backend_cann_device_context *)dev->context;
ggml_backend_cann_context * ctx = (ggml_backend_cann_context *)dev->context;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ggml_backend_cuda_device_context is the latest code of llama.cpp. Do not change this part.

dev_ctx->device = i;
dev_ctx->name = GGML_CANN_NAME + std::to_string(i);
ggml_cann_set_device(i);
ggml_backend_cann_context* dev_ctx = new ggml_backend_cann_context(i);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not change this part.

return nullptr;
}
ggml_cann_set_device(ctx->device);
ggml_backend_dev_t dev = ggml_backend_reg_dev_get(ggml_backend_cann_reg(), device);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not change this part.

@hipudding hipudding added the Ascend NPU issues specific to Ascend NPUs label Nov 25, 2024
ggml_cann_set_device(buft_ctx->device);

size = std::max(size, (size_t)1);
ggml_backend_cann_context* cann_ctx =
Copy link
Collaborator

@hipudding hipudding Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong data type. Cause a core dump.

@@ -470,14 +483,12 @@ std::unique_ptr<ggml_cann_pool> ggml_backend_cann_context::new_pool_for_device(
*/
struct ggml_backend_cann_buffer_context {
int32_t device; ///< The device ID associated with this buffer context.
void* dev_ptr =
nullptr; ///< Pointer to the device memory allocated for the buffer.
void* dev_ptr = nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not modify this line, keep it un-changed.

@@ -486,7 +497,7 @@ struct ggml_backend_cann_buffer_context {
/**
* @brief Destructor to free the device memory allocated for the buffer.
*/
~ggml_backend_cann_buffer_context() { ACL_CHECK(aclrtFree(dev_ptr)); }
~ggml_backend_cann_buffer_context() { ACL_CHECK(aclrtFree(dev_ptr));}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not modify this line, keep it un-changed.


/**
* @brief Constructor to initialize the CANN buffer context.
*
* @param device The device ID associated with this buffer context.
* @param dev_ptr Pointer to the device memory allocated for the buffer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not modify this line, keep it un-changed.

@hipudding hipudding merged commit 9a4b79b into ggerganov:master Nov 26, 2024
54 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
…ganov#10454)

* improve inferencing performance for ascend npu.

Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com>

* some modification after review

* some modifications after review

* restore some modifications

* restore some modifications

---------

Co-authored-by: shanshan shen <shanshanshen333@gmail.com>
Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ascend NPU issues specific to Ascend NPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants