-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to make the deepspeed zero3 integration work with falcon7b #739
Comments
I think it's an issue with the
A fix is probably required on the |
OK I think I've traced it to an issue with ZeRO-3 and two instances of Here's a minimal repro: # falcon_zero3_bug.py
from trl import AutoModelForCausalLMWithValueHead
print("Loading CLM")
model_clm = AutoModelForCausalLMWithValueHead.from_pretrained("tiiuae/falcon-rw-1b")
print("Loaded CLM!")
print("Loading VH...")
model_vh = AutoModelForCausalLMWithValueHead.from_pretrained("tiiuae/falcon-rw-1b")
print("Loaded VH!") Command to reproduce error: accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero3.yaml falcon_zero3_bug.py Curiously, there isn't a problem with instantiating two |
@pacman100 which Here's my complete env:
|
Aha, I figured out the root cause: it's coming from the compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
- zero3_init_flag: true
+ zero3_init_flag: false
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false and then running the following works:
I think the solution is to adjust the Comparison plot to GPT-2 runs |
Thank you @lewtun for the deep dive. I was also looking into this and the weird part is that the minimal code you gave works sometimes and fails the other times. |
To make |
Perhaps this should be fixed now with #758 but not sure |
Trace
The text was updated successfully, but these errors were encountered: