Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

最新版本 transformers 多机训练保存 checkpoint 出错 #1809

Closed
NealRichardRui opened this issue Dec 12, 2023 · 17 comments
Closed

最新版本 transformers 多机训练保存 checkpoint 出错 #1809

NealRichardRui opened this issue Dec 12, 2023 · 17 comments
Labels
solved This problem has been already solved

Comments

@NealRichardRui
Copy link

如题,单机多卡微调报错如下(老版能成功训练):
image

@hiyouga hiyouga added the wontfix This will not be worked on label Dec 12, 2023
@hiyouga hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale Dec 12, 2023
@hiyouga hiyouga removed the wontfix This will not be worked on label Dec 13, 2023
@hiyouga
Copy link
Owner

hiyouga commented Dec 13, 2023

It's an issue related to transformers 4.36.0
huggingface/transformers#27925

@hiyouga hiyouga added the solved This problem has been already solved label Dec 13, 2023
@hiyouga hiyouga closed this as completed Dec 13, 2023
@hiyouga hiyouga added pending This problem is yet to be addressed and removed solved This problem has been already solved labels Dec 13, 2023
@hiyouga hiyouga reopened this Dec 13, 2023
@hiyouga
Copy link
Owner

hiyouga commented Dec 13, 2023

解决方案1:安装修复后的 transformers 库

pip uninstall transformers
pip install git+https://github.com/hiyouga/transformers.git

或使用国内镜像

pip uninstall transformers
pip install git+https://hub.nuaa.cf/hiyouga/transformers.git

解决方案2:使用 stable 分支代码
https://github.com/hiyouga/LLaMA-Factory/tree/stable

并安装 transformers < 4.35

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 13, 2023
@hiyouga hiyouga closed this as completed Dec 13, 2023
@hiyouga hiyouga changed the title 更新最新代码后,指令微调qwen-7b报错 最新版本 transformers 多机训练保存 checkpoint 出错 Dec 13, 2023
@qiuxin610
Copy link

解决方案1:安装修复后的 transformers 库

pip uninstall transformers
pip install git+https://github.com/hiyouga/transformers.git

或使用国内镜像

pip uninstall transformers
pip install git+https://hub.nuaa.cf/hiyouga/transformers.git

解决方案2:使用 stable 分支代码 https://github.com/hiyouga/LLaMA-Factory/tree/stable

并安装 transformers < 4.35

这里需要指定pip install transformers==4.36.0
默认最新的4.36.0.dev会报错

@hiyouga
Copy link
Owner

hiyouga commented Dec 13, 2023

@qiuxin610 是 fork 版本的一些问题,已修复

@lyk0014
Copy link

lyk0014 commented Dec 13, 2023

image
还是没成功。

@hiyouga
Copy link
Owner

hiyouga commented Dec 13, 2023

@lyk0014 修复了,这次重试下

@KelleyYin
Copy link

image 请问一下这是同一个问题吗? @hiyouga

@hiyouga
Copy link
Owner

hiyouga commented Dec 21, 2023

同一个

@menghonghan
Copy link

同一个

你好,我transformers==4.38.2已经更新到最新版本 多机训练保存checkpoint也是会出现这种问题

@trotsky1997
Copy link

同一个

你好,我transformers==4.38.2已经更新到最新版本 多机训练保存checkpoint也是会出现这种问题

请问解决了吗,我这边也是

@hiyouga
Copy link
Owner

hiyouga commented Mar 18, 2024

装最新版本的 transformers 试试(4.39.0.dev0)

@trotsky1997
Copy link

装最新版本的 transformers 试试(4.39.0.dev0)

最新版的移除了一个LLamaFactory import的一个module,引发了另外一个报错
cannot import name 'top_k_top_p_filtering' from 'transformers'

@hiyouga
Copy link
Owner

hiyouga commented Mar 18, 2024

@trotsky1997 同时也要从源码安装 trl

@trotsky1997
Copy link

装最新版本的 transformers 试试(4.39.0.dev0)

@trotsky1997 同时也要从源码安装 trl

感谢,正在尝试,方便加个v吗?

@hiyouga
Copy link
Owner

hiyouga commented Mar 18, 2024

@trotsky1997 有官方交流群

@trotsky1997
Copy link

@trotsky1997 有官方交流群

好的,看到了,感谢!

@gonggaohan
Copy link

@trotsky1997 有官方交流群

哈喽,没找到群名片,能发一下吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

8 participants