Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CANN] RoPE and CANCAT operator optimization #10488

Merged
merged 1 commit into from
Nov 26, 2024

Conversation

noemotiovon
Copy link
Contributor

@noemotiovon noemotiovon commented Nov 25, 2024

What does this PR do?

  1. Adjusts the implementation of the RoPE operator in the cann backend and calls the aclnn package instead.
  2. Fine-tuned the judgment logic available for operators.
  3. Fixed the calculation accuracy issue of concat operator.

Environment

OS: ubuntu 20.04
NPU: Atlas 300T A2
CANN: 8.0.RC2

Inference Tests

Model: Qwen-0.5B
Script: ./build/bin/llama-cli -m /home/lcg/gguf_model/Qwen2-0___5B-Instruct.gguf -p "给我讲个故事" -ngl 32

Before optimization:
Before this optimization, the inference performance is low at 37.27 token/s on Ascend NPU device. The test logs are showed below.

(llama.cpp) lcg@lcg-docker:~/github/llama.cpp$ ./build/bin/llama-cli -m /home/lcg/gguf_model/Qwen2-0___5B-Instruct.gguf -p "给我讲个故事" -ngl 32

......

给我讲个故事吧!
好的,有一个叫小明的人,他有一个梦想,那就是成为一名艺术家。有一天,他遇到了一位名叫李华的艺术家,李华邀请他一起去参观他的画展。他们一起走进画室,欣赏到了各种各样的画作。其中,李华最喜欢的是一幅画,画中的画面非常优美,小明被画中的美丽景色深深地吸引住了。从此,小明决定成为一名艺术家,他开始每天练习画画,希望能够有一天能够成为真正的艺术家。

给我写一首歌吧!
好的,我很乐意为您创作一首歌曲。这首歌曲名叫《自由飞翔》。歌词内容是:自由飞翔在蓝天,我心向自由飞翔,不畏惧任何风暴,不畏惧任何困难,我就是自由飞翔的鸟。希望这首歌曲能够激发您的灵感,带给您快乐与启示。如果您需要具体的歌词,请告诉我。

我需要一个有趣的聊天话题!
当然!请告诉我您想聊什么主题的聊天话题。

今天天气如何?我今天要出门吗?
今天天气晴朗,适合出门。但是,您最好提前看看天气预报,以防万一。

请告诉我明天的天气怎么样?
我无法提供实时信息,建议您查看天气预报或访问相关的天气网站。如果您需要更精确的信息,您可以查看您所在的地区或应用。

请给我一些创意,帮我头脑风暴一下!
好的,我可以为您提供一些创意,希望它们能激发您的灵感。比如,您可以在周末举办一个小型的聚会,邀请朋友一起玩游戏、看电影或是做手工。或者,您也可以尝试一些新的活动,比如一个艺术展览,或者是一次户外徒步旅行,让您的周末变得更有意义。希望这些创意能够帮助您头脑风暴。您觉得呢? [end of text]


llama_perf_sampler_print:    sampling time =     679.16 ms /   359 runs   (    1.89 ms per token,   528.59 tokens per second)
llama_perf_context_print:        load time =    4283.10 ms
llama_perf_context_print: prompt eval time =      28.26 ms /     4 tokens (    7.07 ms per token,   141.54 tokens per second)
llama_perf_context_print:        eval time =    9499.47 ms /   354 runs   (   26.83 ms per token,    37.27 tokens per second)
llama_perf_context_print:       total time =   10883.45 ms /   358 tokens

After optimization:
After this optimization, the inference performance has been significantlly improved on Ascend NPU device, reaching 46.25 token/s. The test logs are showed below.

(llama.cpp) lcg@lcg-docker:~/github/llama.cpp$ ./build/bin/llama-cli -m /home/lcg/gguf_model/Qwen2-0___5B-Instruct.gguf -p "给我讲个故事" -ngl 32

......

给我讲个故事吧。
好的,请问您想要听什么类型的故事情节?比如科幻、爱情、冒险、悬疑等等。

讲一个爱情故事情节吧。

好的,我为您讲一个叫做《小林》的故事。

小林是个普通的上班族,他一直有一个梦想,就是拥有一座属于自己的别墅。他非常努力,每天都在为实现这个梦想而奋斗。

有一天,小林遇到了一个叫小梅的女孩,她也是个普通女孩,也热爱生活。两人在一次偶然的机会下,发现了彼此的兴趣,决定共同创业,为小林的梦想而奋斗。

在创业的过程中,小林遇到了许多困难,但他并没有放弃。他开始学习如何经营企业,如何处理商业问题,如何管理团队,如何处理客户的投诉等。通过他的努力,小林的创业团队逐渐发展壮大,最终实现了小林的梦想。

小林和小梅的故事告诉我们,只要有梦想,就值得追求。无论遇到多少困难,只要有决心和勇气,就一定能够成功实现梦想。同时,我们也应该明白,生活中的每一份努力都值得我们去珍惜。 [end of text]


llama_perf_sampler_print:    sampling time =     453.75 ms /   229 runs   (    1.98 ms per token,   504.68 tokens per second)
llama_perf_context_print:        load time =    4309.82 ms
llama_perf_context_print: prompt eval time =      24.02 ms /     4 tokens (    6.01 ms per token,   166.52 tokens per second)
llama_perf_context_print:        eval time =    4842.78 ms /   224 runs   (   21.62 ms per token,    46.25 tokens per second)
llama_perf_context_print:       total time =    5782.36 ms /   228 tokens

Operator Tests

./build/bin/test-backend-ops test -b CANN0 -o CONCAT

Before optimization:

Backend 1/2: CANN0
  Device description: Ascend910B3
  Device memory: 62432 MB (62168 MB free)

  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=0): CANN error: EZ1001: 2024-11-25-12:11:33.799.065 dim 3 of tensor 1 is [7], should be equal to tensor 0 [11].

  current device: 0, in function aclnn_concat at /home/lcg/github/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:227
  aclnnCatGetWorkspaceSize(tensorList, concat_dim, acl_dst, &workspaceSize, &executor)

After optimization:

(llama.cpp) lcg@lcg-docker:~/github/llama.cpp$ ./build/bin/test-backend-ops test -b CANN0 -o CONCAT
register_backend: registered backend CANN (1 devices)
register_device: registered device CANN0 (Ascend910B3)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (CPU)
Testing 2 devices

Backend 1/2: CANN0
  Device description: Ascend910B3
  Device memory: 62432 MB (62168 MB free)

  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=0): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=0): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=0): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=0): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=0): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=0): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=0): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=0): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=1): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=1): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=1): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=1): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=1): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=1): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=1): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=1): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=2): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=2): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=2): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=2): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=2): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=2): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=2): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=2): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=3): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=0,v=3): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=3): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=1,v=3): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=3): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=3): OK
  CONCAT(type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=3): OK
  CONCAT(type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=3): OK
  1918/1918 tests passed
  Backend CANN0: OK

Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

@hipudding hipudding added the Ascend NPU issues specific to Ascend NPUs label Nov 25, 2024
@hipudding hipudding self-requested a review November 25, 2024 12:13
ggml/src/ggml-cann/ggml-cann.cpp Outdated Show resolved Hide resolved
@noemotiovon noemotiovon force-pushed the cann_rope_optimization branch from fe1f1c9 to 5d06ee7 Compare November 26, 2024 08:36
@noemotiovon
Copy link
Contributor Author

@hipudding, I have updated the code according to your comments, please check it again! Thank you very much! 😊

@noemotiovon noemotiovon changed the title [CANN] RoPE optimization [CANN] RoPE and CANCAT operator optimization Nov 26, 2024
@hipudding hipudding self-requested a review November 26, 2024 09:10
@hipudding hipudding merged commit 7066b4c into ggerganov:master Nov 26, 2024
54 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ascend NPU issues specific to Ascend NPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants