Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add npu support to big model inference #2222

Merged
merged 7 commits into from
Dec 8, 2023

Conversation

statelesshz
Copy link
Contributor

@statelesshz statelesshz commented Dec 6, 2023

What does this PR do?

Fixes #2191

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@statelesshz statelesshz marked this pull request as draft December 6, 2023 11:58
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@junior-zsy
Copy link

junior-zsy commented Dec 7, 2023

@statelesshz @muellerzr Branch code error

File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 678, in get_max_memory
max_memory = {i: torch.npu.mem_get_ifo(i)[0] for i in range(torch.npu.device_count())}
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 678, in
max_memory = {i: torch.npu.mem_get_ifo(i)[0] for i in range(torch.npu.device_count())}
AttributeError: module 'torch_npu.npu' has no attribute 'mem_get_ifo'

@junior-zsy
Copy link

junior-zsy commented Dec 7, 2023

@statelesshz @muellerzr
i change torch.npu.mem_get_ifo to. torch.npu.max_memory_allocated,it works ,But there are other errors

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████| 7/7 [00:07<00:00, 1.06s/it]
Traceback (most recent call last):
File "server_fb.py", line 21, in
response, history = model.chat(tokenizer, "你是谁开发的", history=[])
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 1038, in chat
outputs = self.generate(**inputs, **gen_kwargs, eos_token_id=eos_token_id)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/generation/utils.py", line 1572, in generate
return self.sample(
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/generation/utils.py", line 2619, in sample
outputs = self(
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 940, in forward
transformer_outputs = self.transformer(
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 833, in forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 643, in forward
layer_ret = layer(
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 547, in forward
attention_output, kv_cache = self.self_attention(
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 379, in forward
mixed_x_layer = self.query_key_value(hidden_states)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in call_impl
return forward_call(*args, **kwargs)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: "addmm_impl_cpu" not implemented for 'Half'

@statelesshz
Copy link
Contributor Author

statelesshz commented Dec 7, 2023

Hi @junior-zsy. Thanks for your feedback. Sorry for that this PR needs more testing before it's ready for review.
FYI, the torch_npu v2.1.0 branch recently added torch.npu.mem_get_info, so maybe you need to use torch_npu compiled from source code.

@junior-zsy
Copy link

Hi @junior-zsy. Thanks for your feedback. Sorry for that this PR needs more testing before it's ready for review. FYI, the torch_npu v2.1.0 branch recently added torch.npu.mem_get_info, so maybe you need to use torch_npu compiled from source code.

{0: 64145637376, 1: 64153034752}
Loading checkpoint shards: 0%| | 0/7 [00:01<?, ?it/s]
Traceback (most recent call last):
File "server_fb.py", line 19, in
model = AutoModelForCausalLM.from_pretrained("/home/jovyan/fast-data/chatglm3-6b-32k", device_map="auto",trust_remote_code=True)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained
return model_class.from_pretrained(
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2881, in from_pretrained
) = cls._load_pretrained_model(
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3228, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/modeling_utils.py", line 720, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 315, in set_module_tensor_to_device
new_value = value.to(device)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/cuda/init.py", line 289, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

@junior-zsy
Copy link

@statelesshz

replace .to() with .to("npu:") when using torch_npu,new error ,The model can be loaded now, but it cannot be forward

File "server_fb.py", line 23, in
response, history = model.chat(tokenizer, "你是谁开发的", history=[])
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 1038, in chat
outputs = self.generate(**inputs, **gen_kwargs, eos_token_id=eos_token_id)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/generation/utils.py", line 1572, in generate
return self.sample(
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/generation/utils.py", line 2619, in sample
outputs = self(
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/accelerate/hooks.py", line 160, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/accelerate/hooks.py", line 297, in pre_forward
return send_to_device(args, self.execution_device), send_to_device(
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/accelerate/utils/operations.py", line 161, in send_to_device
{
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/accelerate/utils/operations.py", line 162, in
k: t if k in skip_keys else send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/accelerate/utils/operations.py", line 168, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/cuda/init.py", line 289, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

@junior-zsy
Copy link

junior-zsy commented Dec 7, 2023

@statelesshz the same problem,Need to modify int -> "npu:int",I have modified some of the code and it can now run
Multiple graphics cards can work ,but I have discovered other issues, which is the inability to use multiple cards and threads

code:

import time
import threading

import torch
import torch_npu

def chat_in_thread(tokenizer, model, i):
start_time = time.time()
response, history = model.chat(tokenizer, "你是谁开发的", history=[])
end_time = time.time()
print(f"Thread {i}: Response - {response}")
print(f"Thread {i}: Execution time - {end_time - start_time} seconds")

加载模型前的开始时间

start_time = time.time()

tokenizer = AutoTokenizer.from_pretrained("/home/jovyan/fast-data/chatglm3-6b-32k", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/home/jovyan/fast-data/chatglm3-6b-32k", device_map="auto",trust_remote_code=True)

print(model)
response, history = model.chat(tokenizer, "你是谁开发的", history=[])
print(response)

计算加载模型的时间

model_load_time = time.time() - start_time

创建多个线程执行聊天功能

threads = []
num_threads = 4

for i in range(num_threads):
thread = threading.Thread(target=chat_in_thread, args=(tokenizer, model, i))
threads.append(thread)

启动线程

for thread in threads:
thread.start()

等待所有线程完成

for thread in threads:
thread.join()

print("Model loading time:", model_load_time, "seconds")

error:
我是基于清华大学 KEG 实验室和智谱 AI 公司于 2022 年共同训练的语言模型 GLM3-6B 开发的。我的任务是针对用户的问题和要求提供适当的答复和支持。
Exception in thread Thread-2:
Traceback (most recent call last):
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "server_fb.py", line 10, in chat_in_thread
response, history = model.chat(tokenizer, "你是谁开发的", history=[])
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 1035, in chat
inputs = inputs.to(self.device)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 772, in to
self.data = {k: v.to(device=device) for k, v in self.data.items()}
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 772, in
self.data = {k: v.to(device=device) for k, v in self.data.items()}
RuntimeError: getDevice:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:41 NPU error, error code is 107002
[Error]: The context is empty.
Check whether acl.rt.set_context or acl.rt.set_device is called.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4290]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

Exception in thread Thread-5:
Traceback (most recent call last):
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "server_fb.py", line 10, in chat_in_thread
Exception in thread Thread-3:
Traceback (most recent call last):
response, history = model.chat(tokenizer, "你是谁开发的", history=[])
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Exception in thread Thread-4:
Traceback (most recent call last):
return func(*args, **kwargs)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 1035, in chat
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/threading.py", line 870, in run
self.run()
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/threading.py", line 870, in run
inputs = inputs.to(self.device)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 772, in to
self._target(*self._args, **self._kwargs)
File "server_fb.py", line 10, in chat_in_thread
self.data = {k: v.to(device=device) for k, v in self.data.items()}
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 772, in
response, history = model.chat(tokenizer, "你是谁开发的", history=[])
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
self._target(*self._args, **self._kwargs)
File "server_fb.py", line 10, in chat_in_thread
return func(*args, **kwargs)
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 1035, in chat
self.data = {k: v.to(device=device) for k, v in self.data.items()}
RuntimeError: getDevice:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:41 NPU error, error code is 107002
[Error]: The context is empty.
Check whether acl.rt.set_context or acl.rt.set_device is called.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4290]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

response, history = model.chat(tokenizer, "你是谁开发的", history=[])

File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
inputs = inputs.to(self.device)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 772, in to
return func(*args, **kwargs)
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 1035, in chat
self.data = {k: v.to(device=device) for k, v in self.data.items()}
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 772, in
inputs = inputs.to(self.device)
File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 772, in to
self.data = {k: v.to(device=device) for k, v in self.data.items()}
RuntimeError: getDevice:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:41 NPU error, error code is 107002
[Error]: The context is empty.
Check whether acl.rt.set_context or acl.rt.set_device is called.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4290]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

self.data = {k: v.to(device=device) for k, v in self.data.items()}

File "/home/jovyan/fast-data/mini/envs/huawei/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 772, in
self.data = {k: v.to(device=device) for k, v in self.data.items()}
RuntimeError: getDevice:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:41 NPU error, error code is 107002
[Error]: The context is empty.
Check whether acl.rt.set_context or acl.rt.set_device is called.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4290]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

Model loading time: 18.54421830177307 seconds

@statelesshz
Copy link
Contributor Author

Hi @junior-zsy Let’s focus on the work in progress with this PR :-) If you find some unexpected behavior when using torch_npu, feel free to open a issue

@junior-zsy
Copy link

@statelesshz Okay, you're ignoring the issue of multithreading

@statelesshz statelesshz force-pushed the big-model-inference branch 3 times, most recently from 61a105f to f6d5704 Compare December 7, 2023 10:43
@statelesshz
Copy link
Contributor Author

verified on Baichuan2-13B-Base with the following results:

(inference) [root@node-35 inference]# cat test.py
import torch
import torch_npu
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("/home/gpt_neox/weights_second/Baichuan2-13B-Base", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/home/gpt_neox/weights_second/Baichuan2-13B-Base", device_map="auto", trust_remote_code=True)
inputs = tokenizer('登鹳雀楼->王之涣\n夜雨寄北->', return_tensors='pt')
inputs = inputs.to('npu:0')
pred = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
(inference) [root@node-35 inference]# vim test.py
(inference) [root@node-35 inference]# python test.py
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:58<00:00, 19.38s/it]
登鹳雀楼->王之涣
夜雨寄北->李商隐
望天门山->李白
饮湖上初晴后雨->苏轼
惠崇春江晚景->苏轼
题西林壁->苏轼
夏日绝句->李清照
示儿->陆游
秋夜将晓出篱门迎凉有感->陆游

@statelesshz statelesshz marked this pull request as ready for review December 8, 2023 08:12
@statelesshz
Copy link
Contributor Author

Hi @SunMarc, this PR is ready for review :-)

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this clean integration @statelesshz ! can you have a second look @muellerzr ?

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this! Great job!

@SunMarc SunMarc merged commit 9964f90 into huggingface:main Dec 8, 2023
23 checks passed
@statelesshz statelesshz deleted the big-model-inference branch December 9, 2023 03:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

device_map not work on huawei NPU device
5 participants