[CANN] RoPE and CANCAT operator optimization #10488

noemotiovon · 2024-11-25T12:09:53Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

What does this PR do?

Adjusts the implementation of the RoPE operator in the cann backend and calls the aclnn package instead.
Fine-tuned the judgment logic available for operators.
Fixed the calculation accuracy issue of concat operator.

Environment

OS: ubuntu 20.04
NPU: Atlas 300T A2
CANN: 8.0.RC2

Inference Tests

Model: Qwen-0.5B
Script: ./build/bin/llama-cli -m /home/lcg/gguf_model/Qwen2-0___5B-Instruct.gguf -p "给我讲个故事" -ngl 32

Before optimization:
Before this optimization, the inference performance is low at 37.27 token/s on Ascend NPU device. The test logs are showed below.

(llama.cpp) lcg@lcg-docker:~/github/llama.cpp$ ./build/bin/llama-cli -m /home/lcg/gguf_model/Qwen2-0___5B-Instruct.gguf -p "给我讲个故事" -ngl 32

......

给我讲个故事吧！
好的，有一个叫小明的人，他有一个梦想，那就是成为一名艺术家。有一天，他遇到了一位名叫李华的艺术家，李华邀请他一起去参观他的画展。他们一起走进画室，欣赏到了各种各样的画作。其中，李华最喜欢的是一幅画，画中的画面非常优美，小明被画中的美丽景色深深地吸引住了。从此，小明决定成为一名艺术家，他开始每天练习画画，希望能够有一天能够成为真正的艺术家。

给我写一首歌吧！
好的，我很乐意为您创作一首歌曲。这首歌曲名叫《自由飞翔》。歌词内容是：自由飞翔在蓝天，我心向自由飞翔，不畏惧任何风暴，不畏惧任何困难，我就是自由飞翔的鸟。希望这首歌曲能够激发您的灵感，带给您快乐与启示。如果您需要具体的歌词，请告诉我。

我需要一个有趣的聊天话题！
当然！请告诉我您想聊什么主题的聊天话题。

今天天气如何？我今天要出门吗？
今天天气晴朗，适合出门。但是，您最好提前看看天气预报，以防万一。

请告诉我明天的天气怎么样？
我无法提供实时信息，建议您查看天气预报或访问相关的天气网站。如果您需要更精确的信息，您可以查看您所在的地区或应用。

请给我一些创意，帮我头脑风暴一下！
好的，我可以为您提供一些创意，希望它们能激发您的灵感。比如，您可以在周末举办一个小型的聚会，邀请朋友一起玩游戏、看电影或是做手工。或者，您也可以尝试一些新的活动，比如一个艺术展览，或者是一次户外徒步旅行，让您的周末变得更有意义。希望这些创意能够帮助您头脑风暴。您觉得呢？ [end of text]


llama_perf_sampler_print:    sampling time =     679.16 ms /   359 runs   (    1.89 ms per token,   528.59 tokens per second)
llama_perf_context_print:        load time =    4283.10 ms
llama_perf_context_print: prompt eval time =      28.26 ms /     4 tokens (    7.07 ms per token,   141.54 tokens per second)
llama_perf_context_print:        eval time =    9499.47 ms /   354 runs   (   26.83 ms per token,    37.27 tokens per second)
llama_perf_context_print:       total time =   10883.45 ms /   358 tokens

After optimization:
After this optimization, the inference performance has been significantlly improved on Ascend NPU device, reaching 46.25 token/s. The test logs are showed below.

(llama.cpp) lcg@lcg-docker:~/github/llama.cpp$ ./build/bin/llama-cli -m /home/lcg/gguf_model/Qwen2-0___5B-Instruct.gguf -p "给我讲个故事" -ngl 32

......

给我讲个故事吧。
好的，请问您想要听什么类型的故事情节？比如科幻、爱情、冒险、悬疑等等。

讲一个爱情故事情节吧。

好的，我为您讲一个叫做《小林》的故事。

小林是个普通的上班族，他一直有一个梦想，就是拥有一座属于自己的别墅。他非常努力，每天都在为实现这个梦想而奋斗。

有一天，小林遇到了一个叫小梅的女孩，她也是个普通女孩，也热爱生活。两人在一次偶然的机会下，发现了彼此的兴趣，决定共同创业，为小林的梦想而奋斗。

在创业的过程中，小林遇到了许多困难，但他并没有放弃。他开始学习如何经营企业，如何处理商业问题，如何管理团队，如何处理客户的投诉等。通过他的努力，小林的创业团队逐渐发展壮大，最终实现了小林的梦想。

小林和小梅的故事告诉我们，只要有梦想，就值得追求。无论遇到多少困难，只要有决心和勇气，就一定能够成功实现梦想。同时，我们也应该明白，生活中的每一份努力都值得我们去珍惜。 [end of text]


llama_perf_sampler_print:    sampling time =     453.75 ms /   229 runs   (    1.98 ms per token,   504.68 tokens per second)
llama_perf_context_print:        load time =    4309.82 ms
llama_perf_context_print: prompt eval time =      24.02 ms /     4 tokens (    6.01 ms per token,   166.52 tokens per second)
llama_perf_context_print:        eval time =    4842.78 ms /   224 runs   (   21.62 ms per token,    46.25 tokens per second)
llama_perf_context_print:       total time =    5782.36 ms /   228 tokens

Operator Tests

./build/bin/test-backend-ops test -b CANN0 -o CONCAT

Before optimization:

Backend 1/2: CANN0
  Device description: Ascend910B3
  Device memory: 62432 MB (62168 MB free)

  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=0): CANN error: EZ1001: 2024-11-25-12:11:33.799.065 dim 3 of tensor 1 is [7], should be equal to tensor 0 [11].

  current device: 0, in function aclnn_concat at /home/lcg/github/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:227
  aclnnCatGetWorkspaceSize(tensorList, concat_dim, acl_dst, &workspaceSize, &executor)

After optimization:

(llama.cpp) lcg@lcg-docker:~/github/llama.cpp$ ./build/bin/test-backend-ops test -b CANN0 -o CONCAT
register_backend: registered backend CANN (1 devices)
register_device: registered device CANN0 (Ascend910B3)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (CPU)
Testing 2 devices

Backend 1/2: CANN0
  Device description: Ascend910B3
  Device memory: 62432 MB (62168 MB free)

  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=0): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=0): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=0): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=0): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=0): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=0): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=0): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=0): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=1): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=1): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=1): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=1): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=1): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=1): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=1): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=1): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=2): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=2): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=2): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=2): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=2): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=2): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=2): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=2): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=3): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=3): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=3): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=3): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=3): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=3): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=3): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=3): OK
  1918/1918 tests passed
  Backend CANN0: OK

Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

ggml/src/ggml-cann/ggml-cann.cpp

noemotiovon · 2024-11-26T08:40:38Z

@hipudding, I have updated the code according to your comments, please check it again! Thank you very much! 😊

Co-authored-by: noemotiovon <noemotiovon@gmail.com>

hipudding added the Ascend NPU issues specific to Ascend NPUs label Nov 25, 2024

hipudding self-requested a review November 25, 2024 12:13

hipudding requested changes Nov 26, 2024

View reviewed changes

ggml/src/ggml-cann/ggml-cann.cpp Outdated Show resolved Hide resolved

[cann] RoPE and CANCAT operator optimization

5d06ee7

noemotiovon force-pushed the cann_rope_optimization branch from fe1f1c9 to 5d06ee7 Compare November 26, 2024 08:36

noemotiovon changed the title ~~[CANN] RoPE optimization~~ [CANN] RoPE and CANCAT operator optimization Nov 26, 2024

hipudding self-requested a review November 26, 2024 09:10

hipudding approved these changes Nov 26, 2024

View reviewed changes

hipudding merged commit 7066b4c into ggerganov:master Nov 26, 2024
54 checks passed

hipudding mentioned this pull request Nov 28, 2024

Feature Request: [CANN] Use the RoPE operator provided by aclnn #10396

Closed

4 tasks

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024

CANN: RoPE and CANCAT operator optimization (ggerganov#10488)

a7637d3

Co-authored-by: noemotiovon <noemotiovon@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CANN] RoPE and CANCAT operator optimization #10488

[CANN] RoPE and CANCAT operator optimization #10488

noemotiovon commented Nov 25, 2024 •

edited by hipudding

Loading

noemotiovon commented Nov 26, 2024

[CANN] RoPE and CANCAT operator optimization #10488

[CANN] RoPE and CANCAT operator optimization #10488

Conversation

noemotiovon commented Nov 25, 2024 • edited by hipudding Loading

What does this PR do?

Environment

Inference Tests

Operator Tests

noemotiovon commented Nov 26, 2024

noemotiovon commented Nov 25, 2024 •

edited by hipudding

Loading