[Bug]: Qwen2.5-7B-Instruct 支持 128K tokens，为啥 config 里面写 32k呢，如果我基于qwen2.5 要训练一个大于 32k 的模型，需要怎么做呢 #1134

tao-githup · 2024-12-17T07:20:42Z

tao-githup
Dec 17, 2024

Model Series

Qwen2.5

What are the models used?

Qwen2.5-7B-Instruct

What is the scenario where the problem happened?

sft

Is this a known issue?

I have followed the GitHub README.
I have checked the Qwen documentation and cannot find an answer there.
I have checked the documentation of the related framework and cannot find useful information.
I have searched the issues and there is not a similar one.

Information about environment

no

Log output

no

Description

Steps to reproduce

This happens to Qwen2.5-xB-Instruct-xxx and xxx.
The problem can be reproduced with the following steps:

可以查看 config.json
...

jklj077 · 2024-12-17T09:26:43Z

jklj077
Dec 17, 2024
Maintainer

Please first read the README/Modelcard: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct#processing-long-texts

0 replies

tao-githup · 2024-12-17T10:28:05Z

tao-githup
Dec 17, 2024
Author

@jklj077 这是推理的时候吧，如果是基于他做 sft，直接把max_position_embeddings 改成 128K 行吗，我的训练文本比较长

0 replies

jklj077 · 2024-12-17T12:42:29Z

jklj077
Dec 17, 2024
Maintainer

模型能够支持的序列长度跟训练有关，并不由max_position_embeddings的值控制。改成8K或者16K或者256K并不影响模型训练中见过的最长序列是32K。max_position_embeddings这个名称虽然是transformers常用的，但对于使用相对位置编码的模型，这个命名是有误导性的：理论上使用RoPE的模型即便传入无限长的序列，算法层面也没有越界，没有max一说。（但对于使用可学习的绝对位置编码模型，例如BERT，超长后模型层面就是完全不能处理的。）
使用RoPE的模型天然有一定的长度外推能力，即实际能支持的序列会比训练时使用的长一些，特别是训练中frequency base选取得当时。同时，实际支持的序列长度也可以通过RoPE Scaling技术进行进一步扩展，YaRN是其中一种，相关技术有很多，建议查阅文献。
直接把max_position_embeddings 改成 128K 行吗
并不涉及行或不行，改或不改这个值并不从根本上影响后续训练。你的序列很长，不进行截断，那就是硬上长序列训练。高效的长序列扩展需要考虑的是，要不要改base，要不要在训练中开启RoPE Scaling，要不要设计长度方面的课程学习。这是发展中技术，同样建议查阅相关文献。

2 replies

tao-githup Dec 18, 2024
Author

多谢您的解答。我看了代码中max_position_embeddings这个参数，确实没啥关系，建议在config中可以去掉这个参数，很有误导性，尤其和 Qwen2.5 还不一样，容易让人误解 Qwen2.5 和 Qwen2.5-7B-Instruct有区别。
另外，我想确认一下的是，我看我们模型的rope theta（也就是你说的 base）是1000000，是可以支持128K长度的吧，所以我直接训练128K长度的文本应该是没问题的吧

jklj077 Dec 18, 2024
Maintainer

建议在config中可以去掉这个参数

max_position_embeddings这个名称是transformers常用的，很多生态中的软件依赖这个值确定模型实际能支持的长度。

容易让人误解 Qwen2.5-7B 和 Qwen2.5-7B-Instruct有区别

Qwen2.5-7B 和 Qwen2.5-7B-Instruct在长度支持上是有区别的：使用RoPE的模型天然有一定的长度外推能力，即实际能支持的序列会比训练时使用的长一些。经评估，对于Qwen2.5-7B/基模型，PPL在128K时仍能保持稳定；对于Qwen2.5-7B-Instruct，直接外推效果不佳，需要借助RoPE Scaling技术。

所以我直接训练128K长度的文本应该是没问题的吧

不能完全确定，以实验结果为准。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Qwen2.5-7B-Instruct 支持 128K tokens，为啥 config 里面写 32k呢，如果我基于qwen2.5 要训练一个大于 32k 的模型，需要怎么做呢 #1134

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Bug]: Qwen2.5-7B-Instruct 支持 128K tokens，为啥 config 里面写 32k呢，如果我基于qwen2.5 要训练一个大于 32k 的模型，需要怎么做呢 #1134

tao-githup Dec 17, 2024

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this a known issue?

Information about environment

Log output

Description

Steps to reproduce

Replies: 3 comments · 2 replies

jklj077 Dec 17, 2024 Maintainer

tao-githup Dec 17, 2024 Author

jklj077 Dec 17, 2024 Maintainer

tao-githup Dec 18, 2024 Author

jklj077 Dec 18, 2024 Maintainer

tao-githup
Dec 17, 2024

Replies: 3 comments 2 replies

jklj077
Dec 17, 2024
Maintainer

tao-githup
Dec 17, 2024
Author

jklj077
Dec 17, 2024
Maintainer

tao-githup Dec 18, 2024
Author

jklj077 Dec 18, 2024
Maintainer