关于npu训练模型总结以及疑问 #4388

sweetning0809 · 2024-06-20T05:43:06Z

Reminder

I have read the README and searched the existing issues.

System Info

QWEN2-1.5B(0.5B)

正常

QWEN2-7B(MoE)

需要使用bf16 #4278
正常

QWEN2-72B

正常，有一点点问题，只能在8卡上启动（stage3），16卡上会OOM，需要继续探究原因。

glm4

注释掉torch.jit行使用bf16 参考 #4339 #3788

chatglm3

同上方式
但模型合并后需要将原文件夹除去*bin和pytorch_model.bin.index.json以外的文件复制过来参考 #1307

DeepSeek (MoE)

失败需要将模型做算子转化参考：https://www.hiascend.com/document/detail/zh/Pytorch/60RC1/ptmoddevg/trainingmigrguide/performance_tuning_0027.html#ZH-CN_TOPIC_0000001889766765__section132951137183219

gemma

正常

LLaMA-3

正常

Baichuan-2

正常

PHI3

报错
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 615, in connect
contents = read_file_cached(tiktoken_bpe_file, expected_hash)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 64, in read_file_cached
contents = read_file(blobpath)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 25, in read_file
resp = requests.get(blobpath)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/api.py", line 73, in get
self.sock = sock = self._new_conn()
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 203, in _new_conn
return request("get", url, params=params, **kwargs)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/api.py", line 59, in request
conn.connect()
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 615, in connect
self._validate_conn(conn)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn
return session.request(method=method, url=url, **kwargs)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/sessions.py", line 589, in request
return tokenizer_class.from_pretrained(
File "/home/hadoop-friday-llm/.cache/huggingface/modules/transformers_modules/Phi-3-small-8k-instruct/tokenization_phi3_small.py", line 190, in from_pretrained
raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f4053c11070>: Failed to resolve 'openaipublic.blob.core.windows.net' ([Errno -2] Name or service not known)

Mistral-7B-v0.1

正常

Mixtral-8x7B-v0.1

8卡 64G需要stage3

CodeLlama-7b-hf（13B）

正常

Yi1.5

正常

Reproduction

llamafactory

Expected behavior

主要挑选了一些具有代表性的模型重新在npu上实验希望可以全部成功但是phi3的失败希望可以解答一下模型确认是在本地并使用的绝对路径

Others

No response

hiyouga · 2024-06-20T05:44:21Z

cc @statelesshz

sweetning0809 · 2024-06-20T05:51:37Z

补充报错：Traceback (most recent call last):
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/adapters.py", line 667, in send
return cls(**cls_kwargs)
File "/home/hadoop-friday-llm/.cache/huggingface/modules/transformers_modules/Phi-3-small-8k-instruct/tokenization_phi3_small.py", line 105, in init
base = tiktoken.get_encoding("cl100k_base")
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/registry.py", line 73, in get_encoding
resp = self.send(prep, **send_kwargs)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
enc = Encoding(**constructor())
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken_ext/openai_public.py", line 72, in cl100k_base
self.sock = sock = self._new_conn()
resp = conn.urlopen(
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 203, in _new_conn
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 843, in urlopen
mergeable_ranks = load_tiktoken_bpe(
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 147, in load_tiktoken_bpe
contents = read_file_cached(tiktoken_bpe_file, expected_hash)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 64, in read_file_cached
r = adapter.send(request, **kwargs)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/adapters.py", line 700, in send
raise NameResolutionError(self.host, self, e) from e
contents = read_file(blobpath)
urllib3.exceptions File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 25, in read_file
.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7fa0f0927340>: Failed to resolve 'openaipublic.blob.core.windows.net' ([Errno -2] Name or service not known)
怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络

sweetning0809 · 2024-06-20T05:52:48Z

怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络同样符合是访问openaipublic.blob.core.windows.net

sweetning0809 · 2024-06-20T05:59:17Z

怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络同样符合是访问openaipublic.blob.core.windows.net

查看了模型文件权重文件夹同层存在cl100k_base.tiktoken 可能没有使用上？

sweetning0809 · 2024-06-20T07:31:20Z

怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络同样符合是访问openaipublic.blob.core.windows.net

查看了模型文件权重文件夹同层存在cl100k_base.tiktoken 可能没有使用上？

这个问题解决了是tiktoken.get_encoding("cl100k_base") 必须访问外网
阅读tiktoken的get_encoding源码可以发现先使用了hash再去网上寻找的同时对文件名求了hash1
于是可以

export TIKTOKEN_CACHE_DIR=
然后吧cl100k_base.tiktoken 放在 TIKTOKEN_CACHE_DIR底下同时改名为hash取值：9b5ad71b2ce5302211f9c61530b329a4922fc6a4

但是遇到了新问题assert is_flash_attention_available, "Flash Attention is not available, but is needed for dense attention"
npu无法使用flash_attention 可能和deepseek同样需要算子转换

exceedzhang · 2024-06-22T03:07:41Z

@sweetning0809 python版本是多少？3.10
我使用3.10版本遇到如下问题，训练Qwen2和LLaMA3是正常可以的，但有系统提示错误，我估计会影响模型性能。

sweetning0809 · 2024-06-22T09:52:32Z

@sweetning0809 python版本是多少？3.10 我使用3.10版本遇到如下问题，训练Qwen2和LLaMA3是正常可以的，但有系统提示错误，我估计会影响模型性能。

我回顾了一下日志没有看到这种我是py 3.9 看着像 Ascend/DeepSpeed@c134c39 这个类似的报错

sweetning0809 · 2024-06-22T10:00:31Z

@sweetning0809 python版本是多少？3.10 我使用3.10版本遇到如下问题，训练Qwen2和LLaMA3是正常可以的，但有系统提示错误，我估计会影响模型性能。

做梯度转换的时候没有check但是感觉不会影响效果不是很确定可以训练出来先评测一下

exceedzhang · 2024-06-22T10:04:20Z

感谢！这个错误我查了一下应该只有python3.10才会有，python3.9版本应该不会有这个问题！

Yangr116 · 2024-07-11T09:52:16Z

想问下有和A100的训练速度对比吗？

sweetning0809 · 2024-07-12T09:07:08Z

想问下有和A100的训练速度对比吗？

按照我的测试和官方数据，计算利用率基本都是50-60，然后两者的浮点计算能力差不多910B和A100-80G，理论上我觉得差别不是特别明显。

Yangr116 · 2024-07-13T06:12:37Z

想问下有和A100的训练速度对比吗？

按照我的测试和官方数据，计算利用率基本都是50-60，然后两者的浮点计算能力差不多910B和A100-80G，理论上我觉得差别不是特别明显。

我这边训练 llama 结构的 transformer 时，训练速度慢 4-6 倍。请问你是使用 LoRA的模型测试的吗？

sweetning0809 · 2024-07-15T01:42:31Z

想问下有和A100的训练速度对比吗？

按照我的测试和官方数据，计算利用率基本都是50-60，然后两者的浮点计算能力差不多910B和A100-80G，理论上我觉得差别不是特别明显。

我这边训练 llama 结构的 transformer 时，训练速度慢 4-6 倍。请问你是使用 LoRA的模型测试的吗？

你是不是用了deepspeed会有很长的通信时间

Yangr116 · 2024-07-15T01:48:38Z

没有，我是单卡对比的训练速度。您方便提供下您910B的配置和环境吗？ ning ***@***.***>于2024年7月15日周一09:42写道：

…

想问下有和A100的训练速度对比吗？按照我的测试和官方数据，计算利用率基本都是50-60，然后两者的浮点计算能力差不多910B和A100-80G，理论上我觉得差别不是特别明显。我这边训练 llama 结构的 transformer 时，训练速度慢 4-6 倍。请问你是使用 LoRA的模型测试的吗？你是不是用了deepspeed会有很长的通信时间 — Reply to this email directly, view it on GitHub <#4388 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARTC2EDWBRJNQWJWRFHUI5LZMMSJZAVCNFSM6AAAAABJTHDRUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRXGU3TSMZWGQ> . You are receiving this because you commented.Message ID: ***@***.***>

sweetning0809 · 2024-07-15T03:28:45Z

没有，我是单卡对比的训练速度。您方便提供下您910B的配置和环境吗？ ning @.>于2024年7月15日周一09:42写道：
…
想问下有和A100的训练速度对比吗？按照我的测试和官方数据，计算利用率基本都是50-60，然后两者的浮点计算能力差不多910B和A100-80G，理论上我觉得差别不是特别明显。我这边训练 llama 结构的 transformer 时，训练速度慢 4-6 倍。请问你是使用 LoRA的模型测试的吗？你是不是用了deepspeed会有很长的通信时间 — Reply to this email directly, view it on GitHub <#4388 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTC2EDWBRJNQWJWRFHUI5LZMMSJZAVCNFSM6AAAAABJTHDRUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRXGU3TSMZWGQ . You are receiving this because you commented.Message ID: @.>

你把batch per device打高试试呢你MFU能打到多少

glowwormX · 2024-07-15T08:38:24Z

@sweetning0809 您好，请问下QWEN2-72B，8卡启动的参数是什么，可以贴一下llamafactory的配置和deepspeed的配置吗

sweetning0809 · 2024-07-17T04:51:30Z

@sweetning0809 您好，请问下QWEN2-72B，8卡启动的参数是什么，可以贴一下llamafactory的配置和deepspeed的配置吗

就是example中的示例使用了deepspeedstage3

Yangr116 · 2024-07-20T06:57:21Z

没有，我是单卡对比的训练速度。您方便提供下您910B的配置和环境吗？ ning @.>于2024年7月15日周一09:42写道：
…
想问下有和A100的训练速度对比吗？按照我的测试和官方数据，计算利用率基本都是50-60，然后两者的浮点计算能力差不多910B和A100-80G，理论上我觉得差别不是特别明显。我这边训练 llama 结构的 transformer 时，训练速度慢 4-6 倍。请问你是使用 LoRA的模型测试的吗？你是不是用了deepspeed会有很长的通信时间 — Reply to this email directly, view it on GitHub <#4388 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTC2EDWBRJNQWJWRFHUI5LZMMSJZAVCNFSM6AAAAABJTHDRUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRXGU3TSMZWGQ . You are receiving this because you commented.Message ID: _@**.**_>

你把batch per device打高试试呢你MFU能打到多少

我这里的利用率都非常低，非常奇怪；使用V100 train 相同的代码，利用率在 90% +

sweetning0809 · 2024-07-20T11:34:35Z

所以per device batch size打高可能有效果发自我的 iPhone 在 2024年7月20日，14:57，Ray Yang ***@***.***> 写道：没有，我是单卡对比的训练速度。您方便提供下您910B的配置和环境吗？ ning @.>于2024年7月15日周一09:42写道： … 想问下有和A100的训练速度对比吗？按照我的测试和官方数据，计算利用率基本都是50-60，然后两者的浮点计算能力差不多910B和A100-80G，理论上我觉得差别不是特别明显。我这边训练 llama 结构的 transformer 时，训练速度慢 4-6 倍。请问你是使用 LoRA的模型测试的吗？你是不是用了deepspeed会有很长的通信时间 — Reply to this email directly, view it on GitHub <#4388 (comment)<#4388 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTC2EDWBRJNQWJWRFHUI5LZMMSJZAVCNFSM6AAAAABJTHDRUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRXGU3TSMZWGQ . You are receiving this because you commented.Message ID: _@**.**_> 你把batch per device打高试试呢你MFU能打到多少我这里的利用率都非常低，非常奇怪；使用V100 train 相同的代码，利用率在 90% + image.png (view on web)<https://github.com/user-attachments/assets/55890152-2610-4b48-a371-714b82fab000> — Reply to this email directly, view it on GitHub<#4388 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BBNSRNBEJMRTMCVIDG7PKA3ZNIC6NAVCNFSM6AAAAABJTHDRUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBQHE3DMMZWGA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

github-actions bot added the pending This problem is yet to be addressed label Jun 20, 2024

hiyouga added the npu This problem is related to NPU devices label Jun 20, 2024

AlexYoung757 mentioned this issue Jul 17, 2024

昇腾NPU推理glm4-9b-chat出现'NoneType' object has no attribute 'do_sample' #4858

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于npu训练模型总结以及疑问 #4388

关于npu训练模型总结以及疑问 #4388

sweetning0809 commented Jun 20, 2024 •

edited

Loading

hiyouga commented Jun 20, 2024

sweetning0809 commented Jun 20, 2024

sweetning0809 commented Jun 20, 2024

sweetning0809 commented Jun 20, 2024

sweetning0809 commented Jun 20, 2024 •

edited

Loading

exceedzhang commented Jun 22, 2024 •

edited

Loading

sweetning0809 commented Jun 22, 2024

sweetning0809 commented Jun 22, 2024

exceedzhang commented Jun 22, 2024

Yangr116 commented Jul 11, 2024

sweetning0809 commented Jul 12, 2024

Yangr116 commented Jul 13, 2024

sweetning0809 commented Jul 15, 2024

Yangr116 commented Jul 15, 2024 via email

sweetning0809 commented Jul 15, 2024

glowwormX commented Jul 15, 2024

sweetning0809 commented Jul 17, 2024 •

edited

Loading

Yangr116 commented Jul 20, 2024

sweetning0809 commented Jul 20, 2024 via email

关于npu训练模型总结以及疑问 #4388

关于npu训练模型总结以及疑问 #4388

Comments

sweetning0809 commented Jun 20, 2024 • edited Loading

Reminder

System Info

QWEN2-1.5B(0.5B)

QWEN2-7B(MoE)

QWEN2-72B

glm4

chatglm3

DeepSeek (MoE)

gemma

LLaMA-3

Baichuan-2

PHI3

Mistral-7B-v0.1

Mixtral-8x7B-v0.1

CodeLlama-7b-hf（13B）

Yi1.5

Reproduction

Expected behavior

Others

hiyouga commented Jun 20, 2024

sweetning0809 commented Jun 20, 2024

sweetning0809 commented Jun 20, 2024

sweetning0809 commented Jun 20, 2024

sweetning0809 commented Jun 20, 2024 • edited Loading

exceedzhang commented Jun 22, 2024 • edited Loading

sweetning0809 commented Jun 22, 2024

sweetning0809 commented Jun 22, 2024

exceedzhang commented Jun 22, 2024

Yangr116 commented Jul 11, 2024

sweetning0809 commented Jul 12, 2024

Yangr116 commented Jul 13, 2024

sweetning0809 commented Jul 15, 2024

Yangr116 commented Jul 15, 2024 via email

sweetning0809 commented Jul 15, 2024

glowwormX commented Jul 15, 2024

sweetning0809 commented Jul 17, 2024 • edited Loading

Yangr116 commented Jul 20, 2024

sweetning0809 commented Jul 20, 2024 via email

sweetning0809 commented Jun 20, 2024 •

edited

Loading

sweetning0809 commented Jun 20, 2024 •

edited

Loading

exceedzhang commented Jun 22, 2024 •

edited

Loading

sweetning0809 commented Jul 17, 2024 •

edited

Loading