-
Notifications
You must be signed in to change notification settings - Fork 678
Open
Labels
discussionStart a discussionStart a discussion
Description
Hi, when I set fsdp_reshard_after_forward: False, the training speed increased by approximately 5-7%(tokens_per_second_per_gpu). Are there any other configurations that affect performance? Or where do you recommend referring to for configurations?
In addition, the setting of gradient_accumulation_steps does not affect the speed. Generally speaking, setting a larger value will reduce the frequency of communication and speed up the training. The model used in the experiment is Qwen 2.5 3B.
Metadata
Metadata
Assignees
Labels
discussionStart a discussionStart a discussion