Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于npu训练模型总结以及疑问 #4388

Open
1 task done
sweetning0809 opened this issue Jun 20, 2024 · 19 comments
Open
1 task done

关于npu训练模型总结以及疑问 #4388

sweetning0809 opened this issue Jun 20, 2024 · 19 comments
Labels
npu This problem is related to NPU devices pending This problem is yet to be addressed

Comments

@sweetning0809
Copy link

sweetning0809 commented Jun 20, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

QWEN2-1.5B(0.5B)

正常

QWEN2-7B(MoE)

需要使用bf16 #4278
正常

QWEN2-72B

正常,有一点点问题,只能在8卡上启动(stage3),16卡上会OOM,需要继续探究原因。

glm4

注释掉torch.jit行 使用bf16 参考 #4339 #3788

chatglm3

同上方式
但模型合并后需要将原文件夹除去*bin和pytorch_model.bin.index.json以外的文件复制过来 参考 #1307

DeepSeek (MoE)

失败 需要将模型做算子转化 参考:https://www.hiascend.com/document/detail/zh/Pytorch/60RC1/ptmoddevg/trainingmigrguide/performance_tuning_0027.html#ZH-CN_TOPIC_0000001889766765__section132951137183219

gemma

正常

LLaMA-3

正常

Baichuan-2

正常

PHI3

报错
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 615, in connect
contents = read_file_cached(tiktoken_bpe_file, expected_hash)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 64, in read_file_cached
contents = read_file(blobpath)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 25, in read_file
resp = requests.get(blobpath)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/api.py", line 73, in get
self.sock = sock = self._new_conn()
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 203, in _new_conn
return request("get", url, params=params, **kwargs)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/api.py", line 59, in request
conn.connect()
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 615, in connect
self._validate_conn(conn)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn
return session.request(method=method, url=url, **kwargs)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/sessions.py", line 589, in request
return tokenizer_class.from_pretrained(
File "/home/hadoop-friday-llm/.cache/huggingface/modules/transformers_modules/Phi-3-small-8k-instruct/tokenization_phi3_small.py", line 190, in from_pretrained
raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f4053c11070>: Failed to resolve 'openaipublic.blob.core.windows.net' ([Errno -2] Name or service not known)

Mistral-7B-v0.1

正常

Mixtral-8x7B-v0.1

8卡 64G需要stage3

CodeLlama-7b-hf(13B)

正常

Yi1.5

正常

Reproduction

llamafactory

Expected behavior

主要挑选了一些具有代表性的模型 重新在npu上实验 希望可以全部成功 但是phi3的失败希望可以解答一下 模型确认是在本地 并使用的绝对路径

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jun 20, 2024
@hiyouga hiyouga added the npu This problem is related to NPU devices label Jun 20, 2024
@hiyouga
Copy link
Owner

hiyouga commented Jun 20, 2024

cc @statelesshz

@sweetning0809
Copy link
Author

补充报错:Traceback (most recent call last):
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/adapters.py", line 667, in send
return cls(**cls_kwargs)
File "/home/hadoop-friday-llm/.cache/huggingface/modules/transformers_modules/Phi-3-small-8k-instruct/tokenization_phi3_small.py", line 105, in init
base = tiktoken.get_encoding("cl100k_base")
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/registry.py", line 73, in get_encoding
resp = self.send(prep, **send_kwargs)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
enc = Encoding(**constructor())
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken_ext/openai_public.py", line 72, in cl100k_base
self.sock = sock = self._new_conn()
resp = conn.urlopen(
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 203, in _new_conn
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 843, in urlopen
mergeable_ranks = load_tiktoken_bpe(
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 147, in load_tiktoken_bpe
contents = read_file_cached(tiktoken_bpe_file, expected_hash)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 64, in read_file_cached
r = adapter.send(request, **kwargs)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/adapters.py", line 700, in send
raise NameResolutionError(self.host, self, e) from e
contents = read_file(blobpath)
urllib3.exceptions File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 25, in read_file
.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7fa0f0927340>: Failed to resolve 'openaipublic.blob.core.windows.net' ([Errno -2] Name or service not known)
怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络

@sweetning0809
Copy link
Author

怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络 同样符合是访问openaipublic.blob.core.windows.net

@sweetning0809
Copy link
Author

怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络 同样符合是访问openaipublic.blob.core.windows.net

查看了模型文件 权重文件夹同层存在cl100k_base.tiktoken 可能没有使用上?

@sweetning0809
Copy link
Author

sweetning0809 commented Jun 20, 2024

怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络 同样符合是访问openaipublic.blob.core.windows.net

查看了模型文件 权重文件夹同层存在cl100k_base.tiktoken 可能没有使用上?

这个问题解决了是tiktoken.get_encoding("cl100k_base") 必须访问外网
阅读tiktoken的get_encoding源码可以发现先使用了hash再去网上寻找的 同时对文件名求了hash1
于是可以

  1. export TIKTOKEN_CACHE_DIR=
  2. 然后吧cl100k_base.tiktoken 放在 TIKTOKEN_CACHE_DIR底下同时改名为hash取值:9b5ad71b2ce5302211f9c61530b329a4922fc6a4

但是遇到了新问题assert is_flash_attention_available, "Flash Attention is not available, but is needed for dense attention"
npu无法使用flash_attention 可能和deepseek同样需要算子转换

@exceedzhang
Copy link

exceedzhang commented Jun 22, 2024

@sweetning0809 python版本是多少?3.10
我使用3.10版本遇到如下问题,训练Qwen2和LLaMA3是正常可以的,但有系统提示错误,我估计会影响模型性能。
image

@sweetning0809
Copy link
Author

@sweetning0809 python版本是多少?3.10 我使用3.10版本遇到如下问题,训练Qwen2和LLaMA3是正常可以的,但有系统提示错误,我估计会影响模型性能。 image

我回顾了一下日志没有看到这种 我是py 3.9 看着像 Ascend/DeepSpeed@c134c39 这个类似的报错

@sweetning0809
Copy link
Author

@sweetning0809 python版本是多少?3.10 我使用3.10版本遇到如下问题,训练Qwen2和LLaMA3是正常可以的,但有系统提示错误,我估计会影响模型性能。 image

做梯度转换的时候没有check但是感觉不会影响效果 不是很确定 可以训练出来先评测一下

@exceedzhang
Copy link

感谢!这个错误我查了一下应该只有python3.10才会有,python3.9版本应该不会有这个问题!
image

@Yangr116
Copy link

想问下有和A100的训练速度对比吗?

@sweetning0809
Copy link
Author

想问下有和A100的训练速度对比吗?

按照我的测试和官方数据,计算利用率基本都是50-60,然后两者的浮点计算能力差不多910B和A100-80G,理论上我觉得差别不是特别明显。

@Yangr116
Copy link

想问下有和A100的训练速度对比吗?

按照我的测试和官方数据,计算利用率基本都是50-60,然后两者的浮点计算能力差不多910B和A100-80G,理论上我觉得差别不是特别明显。

我这边训练 llama 结构的 transformer 时,训练速度慢 4-6 倍。请问你是使用 LoRA的模型测试的吗?

@sweetning0809
Copy link
Author

想问下有和A100的训练速度对比吗?

按照我的测试和官方数据,计算利用率基本都是50-60,然后两者的浮点计算能力差不多910B和A100-80G,理论上我觉得差别不是特别明显。

我这边训练 llama 结构的 transformer 时,训练速度慢 4-6 倍。请问你是使用 LoRA的模型测试的吗?

你是不是用了deepspeed会有很长的通信时间

@Yangr116
Copy link

Yangr116 commented Jul 15, 2024 via email

@sweetning0809
Copy link
Author

没有,我是单卡对比的训练速度。您方便提供下您910B的配置和环境吗? ning @.>于2024年7月15日 周一09:42写道:

想问下有和A100的训练速度对比吗? 按照我的测试和官方数据,计算利用率基本都是50-60,然后两者的浮点计算能力差不多910B和A100-80G,理论上我觉得差别不是特别明显。 我这边训练 llama 结构的 transformer 时,训练速度慢 4-6 倍。请问你是使用 LoRA的模型测试的吗? 你是不是用了deepspeed会有很长的通信时间 — Reply to this email directly, view it on GitHub <#4388 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTC2EDWBRJNQWJWRFHUI5LZMMSJZAVCNFSM6AAAAABJTHDRUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRXGU3TSMZWGQ . You are receiving this because you commented.Message ID: @.
>

你把batch per device打高试试呢 你MFU能打到多少

@glowwormX
Copy link

@sweetning0809 您好,请问下QWEN2-72B,8卡启动的参数是什么,可以贴一下llamafactory的配置和deepspeed的配置吗

@sweetning0809
Copy link
Author

sweetning0809 commented Jul 17, 2024

@sweetning0809 您好,请问下QWEN2-72B,8卡启动的参数是什么,可以贴一下llamafactory的配置和deepspeed的配置吗

就是example中的示例 使用了deepspeedstage3

@Yangr116
Copy link

没有,我是单卡对比的训练速度。您方便提供下您910B的配置和环境吗? ning @.>于2024年7月15日 周一09:42写道:

想问下有和A100的训练速度对比吗? 按照我的测试和官方数据,计算利用率基本都是50-60,然后两者的浮点计算能力差不多910B和A100-80G,理论上我觉得差别不是特别明显。 我这边训练 llama 结构的 transformer 时,训练速度慢 4-6 倍。请问你是使用 LoRA的模型测试的吗? 你是不是用了deepspeed会有很长的通信时间 — Reply to this email directly, view it on GitHub <#4388 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTC2EDWBRJNQWJWRFHUI5LZMMSJZAVCNFSM6AAAAABJTHDRUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRXGU3TSMZWGQ . You are receiving this because you commented.Message ID: _
@**
.**_>

你把batch per device打高试试呢 你MFU能打到多少

我这里的利用率都非常低,非常奇怪;使用V100 train 相同的代码,利用率在 90% +

image

@sweetning0809
Copy link
Author

sweetning0809 commented Jul 20, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
npu This problem is related to NPU devices pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

5 participants