Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatting Packing / maybe fix #5443 and #5426 #5458

Closed
wants to merge 2 commits into from

Conversation

AlongWY
Copy link
Contributor

@AlongWY AlongWY commented Sep 17, 2024

What does this PR do?

  1. support flatting_packing
  2. fix knapsack, may cause Running tokenizer on dataset 速度逐渐变慢 #5443
  3. avoid supervised examples wrongly truncation 使用neat_packing进行sft训练,模型性能指标下降明显 #5426

Before submitting

@AlongWY AlongWY marked this pull request as draft September 17, 2024 19:00
if total_length >= cutoff_len:
break

source_len, target_len = infer_seqlen(len(source_ids), len(target_ids), cutoff_len - total_length)
Copy link
Contributor Author

@AlongWY AlongWY Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里导致 Inst 数据被异常截断 #5426, 也许考虑引入一个新的参数来保证是否可以被截断?我的样本是2轮次的 tool 调用,但是如果截断就只会学习到输出 tool_calls 没有最后的答案了。 而且这里现在截断的实现方式将会导致 user 和 assistant 的内容被截断。如在 mistral 模板中, 会产生 [INST] xxxxxxx 的结果,而xxxxx[/INST] 就不见了,这显然是不正确的。

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得不是这里的问题?non-packing 也会有同样的行为

Copy link
Contributor Author

@AlongWY AlongWY Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不过我确实觉得需要加一个参数控制一下,因为有些情况下不允许一个样本被中间截断

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不截 prompt 的话 assistant 放在哪里呢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

直接跳过,drop掉这个样本

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加了参数控制是否可以截断,默认不能截断

@AlongWY AlongWY changed the title 支持 Mistral 格式的 function call 和 Flatting Packing Flatting Packing / mistral style function call / maybe fix #5443 and #5426 Sep 17, 2024
@AlongWY AlongWY marked this pull request as ready for review September 17, 2024 22:13
packed_input_ids.append(batch_input_ids[index])
packed_labels.append(batch_labels[index])
packed_images.append(batch_images[index])
packed_videos.append(batch_videos[index])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

延迟处理,此时先不返回 position ids,在 collator 中整合并返回 position ids

data_args.flatting_packing and
(getattr(model.config, "_attn_implementation", None) != "flash_attention_2")
):
logger.warning("The `flatting_packing` only support `flash_attention_2`! Maybe cause Out of memory!")
Copy link
Contributor Author

@AlongWY AlongWY Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

也许应该强制开启 fa2,但是这个时候已经晚了

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flat packing 应该不是和 fa2 强制绑定的,本质上就是 4d attention mask

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该是绑定的,packing-with-FA2,他是通过 flash-attention 直接计算的,不需要 4d attention mask 了,虽然本质上是这样的,但是 fa2 不能输入 4d attention mask,细节可以看这个 transformers pull request

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我知道,他的实现是绑定的,原理上 sdpa 和 eager 照样能用

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那可能也行

@cx9208
Copy link

cx9208 commented Sep 18, 2024

想问下这个flatting_packing和neat_packing的区别是什么呢,单看选项说明(Enable sequence packing with flattening)仍然不太理解

@AlongWY
Copy link
Contributor Author

AlongWY commented Sep 18, 2024

实现了这个 packing-with-FA2,经测试,该方案练吞吐量比 neat_packing 更高

@hiyouga hiyouga added the pending This problem is yet to be addressed label Sep 18, 2024
@hiyouga hiyouga self-requested a review September 18, 2024 02:43
@AlongWY
Copy link
Contributor Author

AlongWY commented Sep 18, 2024

mistral 的 function call 我还在修改,晚会提交

@hiyouga
Copy link
Owner

hiyouga commented Sep 18, 2024

could you open another pr for function call updates?

@AlongWY
Copy link
Contributor Author

AlongWY commented Sep 18, 2024

好的,那我重新整理一下代码?

2. fix knapsack, may cause hiyouga#5443
3. avoid supervised examples wrongly truncation
@AlongWY AlongWY changed the title Flatting Packing / mistral style function call / maybe fix #5443 and #5426 Flatting Packing / maybe fix #5443 and #5426 Sep 18, 2024
@AlongWY
Copy link
Contributor Author

AlongWY commented Sep 18, 2024

现在应该是一个干净的提交,工具调用的 PR 在 #5473

@muziyongshixin
Copy link

实现了这个 packing-with-FA2,经测试,该方案练吞吐量比 neat_packing 更高
请问这个flatting packing有验证过收敛性么?

我在相同数据集上相同训练配置尝试了一下neat_packing 和 flatting_packing 发现flatting_packing 初始loss显著高于neat_packing(2.1 vs 0.9)
而且flatting_packing 训练step数高于neat_packing(10454 vs 9850)
训练完的结果也不如neat_packing

模型参数YI-9B lr=1e-5

@muziyongshixin
Copy link

muziyongshixin commented Sep 26, 2024

实现了这个 packing-with-FA2,经测试,该方案练吞吐量比 neat_packing 更高
请问这个flatting packing有验证过收敛性么?

我在相同数据集上相同训练配置尝试了一下neat_packing 和 flatting_packing 发现flatting_packing 初始loss显著高于neat_packing(2.1 vs 0.9) 而且flatting_packing 训练step数高于neat_packing(10454 vs 9850) 训练完的结果也不如neat_packing

模型参数YI-9B lr=1e-5

找到flatten_packing初始loss高的原因了,transformers版本需要升级到最新4.45.0,accelerate==0.34.2
初始loss跟neat_packing差不多都是0.9左右的水平,同时step数有略微减小10454->10198,从预估的训练时间看略微提速(150h->131h) 不确定这些改变来自于哪里。
具体训练完的效果还有待验证。

@juncaofish
Copy link

Any updates for this PR?

@Arcmoon-Hu
Copy link

实现了这个 packing-with-FA2,经测试,该方案练吞吐量比 neat_packing 更高
请问这个flatting packing有验证过收敛性么?

我在相同数据集上相同训练配置尝试了一下neat_packing 和 flatting_packing 发现flatting_packing 初始loss显著高于neat_packing(2.1 vs 0.9) 而且flatting_packing 训练step数高于neat_packing(10454 vs 9850) 训练完的结果也不如neat_packing
模型参数YI-9B lr=1e-5

找到flatten_packing初始loss高的原因了,transformers版本需要升级到最新4.45.0,accelerate==0.34.2 初始loss跟neat_packing差不多都是0.9左右的水平,同时step数有略微减小10454->10198,从预估的训练时间看略微提速(150h->131h) 不确定这些改变来自于哪里。 具体训练完的效果还有待验证。

好心人做完实验了吗,效果对比怎么样哇

@AlongWY
Copy link
Contributor Author

AlongWY commented Oct 18, 2024

@hiyouga 目前的实现有什么问题吗?

@Alwin4Zhang
Copy link

目前是一个什么状态了,neat_packing + fa2 是否达到了同等的训练loss,目前测试下来实际效果挺差的,要么就是无限循环输出,要么就是输出一些怪怪的其他文字,如韩文,法文等。明显是数据concat时候带进去的

@AlongWY
Copy link
Contributor Author

AlongWY commented Dec 13, 2024

neat_packing 的 concat 似乎是有问题的,我这里处理了一下,但是不知道什么原因一直没合并

@AlongWY AlongWY deleted the branch hiyouga:main December 25, 2024 06:43
@AlongWY AlongWY closed this Dec 25, 2024
@AlongWY AlongWY deleted the main branch December 25, 2024 06:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants