You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
here are my questions,I have more than 4 gpus to run the train.py,but it still out of memory,I check the usage of memory and find that one of them overflows and produce the bug,how can I solve it?
#23
Open
z1968357787 opened this issue
Apr 6, 2023
· 0 comments
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Traceback (most recent call last):
File "", line 1, in
FileNotFoundError: [Errno 2] No such file or directory: '/home/cike/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/551a50efec3acc5a9b94de8ec46d33d0f81919f7/modeling_chatglm.py'
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:15<00:00, 1.92s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:15<00:00, 1.95s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:15<00:00, 1.96s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:15<00:00, 1.97s/it]
trainable_params:22020096 (0.35%), non_trainable_params:6255206400
trainable_params:22020096 (0.35%), non_trainable_params:6255206400
trainable_params:22020096 (0.35%), non_trainable_params:6255206400
trainable_params:22020096 (0.35%), non_trainable_params:6255206400
[2023-04-06 08:33:02,712] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-04-06 08:33:02,747] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-04-06 08:33:02,756] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-04-06 08:33:02,874] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-04-06 08:33:24,442] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-04-06 08:33:24,445] [INFO] [logging.py:93:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-04-06 08:33:24,445] [INFO] [logging.py:93:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-04-06 08:33:24,509] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2023-04-06 08:33:24,509] [INFO] [utils.py:55:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-04-06 08:33:24,509] [INFO] [logging.py:93:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2023-04-06 08:33:24,510] [INFO] [stage_1_and_2.py:144:init] Reduce bucket size 500,000,000
[2023-04-06 08:33:24,510] [INFO] [stage_1_and_2.py:145:init] Allgather bucket size 500,000,000
[2023-04-06 08:33:24,510] [INFO] [stage_1_and_2.py:146:init] CPU Offload: False
[2023-04-06 08:33:24,510] [INFO] [stage_1_and_2.py:147:init] Round robin gradient partitioning: False
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Emitting ninja build file /home/cike/.cache/torch_extensions/py39_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.35059165954589844 seconds
Loading extension module utils...
Time to load utils op: 0.40517687797546387 seconds
Loading extension module utils...
Time to load utils op: 0.40523695945739746 seconds
Loading extension module utils...
Time to load utils op: 0.40430521965026855 seconds
Rank: 3 partition count [4] and sizes[(5505024, False)]
Rank: 2 partition count [4] and sizes[(5505024, False)]
Rank: 0 partition count [4] and sizes[(5505024, False)]
Rank: 1 partition count [4] and sizes[(5505024, False)]
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00042438507080078125 seconds
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Time to load utils op: 0.00040459632873535156 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00035071372985839844 seconds
0%| | 0/12241 [00:00<?, ?it/s][2023-04-06 08:33:25,838] [INFO] [utils.py:829:see_memory_usage] Before initializing optimizer states
[2023-04-06 08:33:25,839] [INFO] [utils.py:830:see_memory_usage] MA 11.71 GB Max_MA 11.72 GB CA 11.75 GB Max_CA 12 GB
[2023-04-06 08:33:25,839] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 12.0 GB, percent = 4.8%
[2023-04-06 08:33:26,025] [INFO] [utils.py:829:see_memory_usage] After initializing optimizer states
[2023-04-06 08:33:26,025] [INFO] [utils.py:830:see_memory_usage] MA 11.76 GB Max_MA 11.82 GB CA 11.85 GB Max_CA 12 GB
[2023-04-06 08:33:26,026] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 12.0 GB, percent = 4.8%
[2023-04-06 08:33:26,026] [INFO] [stage_1_and_2.py:520:init] optimizer state initialized
[2023-04-06 08:33:26,092] [INFO] [utils.py:829:see_memory_usage] After initializing ZeRO optimizer
[2023-04-06 08:33:26,093] [INFO] [utils.py:830:see_memory_usage] MA 11.76 GB Max_MA 11.76 GB CA 11.85 GB Max_CA 12 GB
[2023-04-06 08:33:26,093] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 12.0 GB, percent = 4.8%
[2023-04-06 08:33:26,094] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-04-06 08:33:26,095] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-04-06 08:33:26,095] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-04-06 08:33:26,095] [INFO] [logging.py:93:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.999)]
[2023-04-06 08:33:26,096] [INFO] [config.py:1018:print] DeepSpeedEngine configuration:
[2023-04-06 08:33:26,096] [INFO] [config.py:1022:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-04-06 08:33:26,096] [INFO] [config.py:1022:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-04-06 08:33:26,096] [INFO] [config.py:1022:print] amp_enabled .................. False
[2023-04-06 08:33:26,096] [INFO] [config.py:1022:print] amp_params ................... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] bfloat16_enabled ............. True
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] checkpoint_parallel_write_pipeline False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] checkpoint_tag_validation_enabled True
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] checkpoint_tag_validation_fail False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f15140266a0>
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] communication_data_type ...... None
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] curriculum_enabled_legacy .... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] curriculum_params_legacy ..... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] data_efficiency_enabled ...... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] dataloader_drop_last ......... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] disable_allgather ............ False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] dump_state ................... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] dynamic_loss_scale_args ...... None
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_enabled ........... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_gas_boundary_resolution 1
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_layer_num ......... 0
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_max_iter .......... 100
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_stability ......... 1e-06
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_tol ............... 0.01
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_verbose ........... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] elasticity_enabled ........... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] fp16_auto_cast ............... None
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] fp16_enabled ................. False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] fp16_master_weights_and_gradients False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] global_rank .................. 0
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] grad_accum_dtype ............. None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] gradient_accumulation_steps .. 8
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] gradient_clipping ............ 0.0
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] gradient_predivide_factor .... 1.0
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] initial_dynamic_scale ........ 1
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] load_universal_checkpoint .... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] loss_scale ................... 1.0
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] memory_breakdown ............. False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] optimizer_legacy_fusion ...... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] optimizer_name ............... None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] optimizer_params ............. None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] pld_enabled .................. False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] pld_params ................... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] prescale_gradients ........... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] scheduler_name ............... None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] scheduler_params ............. None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] sparse_attention ............. None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] sparse_gradients_enabled ..... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] steps_per_print .............. inf
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] train_batch_size ............. 32
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] train_micro_batch_size_per_gpu 1
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] use_node_local_storage ....... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] wall_clock_breakdown ......... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] world_size ................... 4
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_allow_untested_optimizer True
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_enabled ................. True
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_force_ds_cpu_optimizer .. True
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_optimization_stage ...... 2
[2023-04-06 08:33:26,099] [INFO] [config.py:1007:print_user_config] json = {
"train_batch_size": 32,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 8,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none"
},
"offload_param": {
"device": "none"
},
"stage3_gather_16bit_weights_on_model_save": false
},
"steps_per_print": inf,
"bf16": {
"enabled": true
},
"fp16": {
"enabled": false
},
"zero_allow_untested_optimizer": true
}
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00032019615173339844 seconds
loss: 2.640625: 0%| | 1/12241 [00:04<14:19:20, 4.21s/it]
Traceback (most recent call last):
File "/home/cike/zzp/LoRA/ChatGLM-finetune-LoRA/train.py", line 220, in
accelerator.backward(loss)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1677, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2008, in backward
self.allreduce_gradients()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1918, in allreduce_gradients
self.optimizer.overlapping_partition_gradients_reduce_epilogue()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 834, in overlapping_partition_gradients_reduce_epilogue
self.independent_gradient_partition_epilogue()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 720, in independent_gradient_partition_epilogue
self.reduce_ipg_grads()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1287, in reduce_ipg_grads
self.average_tensor(self.ipg_buffer[self.ipg_index])
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1018, in average_tensor
tensor_to_reduce = tensor.to(self.communication_data_type)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 954.00 MiB (GPU 1; 15.90 GiB total capacity; 12.84 GiB already allocated; 927.75 MiB free; 14.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
loss: 2.75: 0%| | 1/12241 [00:04<15:35:54, 4.59s/it]
Traceback (most recent call last):
File "/home/cike/zzp/LoRA/ChatGLM-finetune-LoRA/train.py", line 220, in
accelerator.backward(loss)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1677, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2008, in backward
self.allreduce_gradients()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1918, in allreduce_gradients
self.optimizer.overlapping_partition_gradients_reduce_epilogue()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 834, in overlapping_partition_gradients_reduce_epilogue
self.independent_gradient_partition_epilogue()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 720, in independent_gradient_partition_epilogue
self.reduce_ipg_grads()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1287, in reduce_ipg_grads
self.average_tensor(self.ipg_buffer[self.ipg_index])
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1018, in average_tensor
tensor_to_reduce = tensor.to(self.communication_data_type)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 954.00 MiB (GPU 3; 15.90 GiB total capacity; 12.88 GiB already allocated; 903.75 MiB free; 14.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107844 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107846 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 107845) of binary: /home/cike/anaconda/envs/lora/bin/python
Traceback (most recent call last):
File "/home/cike/anaconda/envs/lora/bin/accelerate", line 8, in
sys.exit(main())
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/commands/launch.py", line 908, in launch_command
deepspeed_launcher(args)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/commands/launch.py", line 647, in deepspeed_launcher
distrib_run.run(args)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Explicitly passing a
revision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.Explicitly passing a
revision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.Explicitly passing a
revision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.Explicitly passing a
revision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.Explicitly passing a
revision
is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.Explicitly passing a
revision
is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.Explicitly passing a
revision
is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.Explicitly passing a
revision
is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.Explicitly passing a
revision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.Explicitly passing a
revision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.Explicitly passing a
revision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.Explicitly passing a
revision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.Traceback (most recent call last):
File "", line 1, in
FileNotFoundError: [Errno 2] No such file or directory: '/home/cike/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/551a50efec3acc5a9b94de8ec46d33d0f81919f7/modeling_chatglm.py'
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:15<00:00, 1.92s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:15<00:00, 1.95s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:15<00:00, 1.96s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:15<00:00, 1.97s/it]
trainable_params:22020096 (0.35%), non_trainable_params:6255206400
trainable_params:22020096 (0.35%), non_trainable_params:6255206400
trainable_params:22020096 (0.35%), non_trainable_params:6255206400
trainable_params:22020096 (0.35%), non_trainable_params:6255206400
[2023-04-06 08:33:02,712] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-04-06 08:33:02,747] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-04-06 08:33:02,756] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-04-06 08:33:02,874] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-04-06 08:33:24,442] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-04-06 08:33:24,445] [INFO] [logging.py:93:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-04-06 08:33:24,445] [INFO] [logging.py:93:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-04-06 08:33:24,509] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2023-04-06 08:33:24,509] [INFO] [utils.py:55:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-04-06 08:33:24,509] [INFO] [logging.py:93:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2023-04-06 08:33:24,510] [INFO] [stage_1_and_2.py:144:init] Reduce bucket size 500,000,000
[2023-04-06 08:33:24,510] [INFO] [stage_1_and_2.py:145:init] Allgather bucket size 500,000,000
[2023-04-06 08:33:24,510] [INFO] [stage_1_and_2.py:146:init] CPU Offload: False
[2023-04-06 08:33:24,510] [INFO] [stage_1_and_2.py:147:init] Round robin gradient partitioning: False
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Emitting ninja build file /home/cike/.cache/torch_extensions/py39_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.35059165954589844 seconds
Loading extension module utils...
Time to load utils op: 0.40517687797546387 seconds
Loading extension module utils...
Time to load utils op: 0.40523695945739746 seconds
Loading extension module utils...
Time to load utils op: 0.40430521965026855 seconds
Rank: 3 partition count [4] and sizes[(5505024, False)]
Rank: 2 partition count [4] and sizes[(5505024, False)]
Rank: 0 partition count [4] and sizes[(5505024, False)]
Rank: 1 partition count [4] and sizes[(5505024, False)]
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00042438507080078125 seconds
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Time to load utils op: 0.00040459632873535156 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00035071372985839844 seconds
0%| | 0/12241 [00:00<?, ?it/s][2023-04-06 08:33:25,838] [INFO] [utils.py:829:see_memory_usage] Before initializing optimizer states
[2023-04-06 08:33:25,839] [INFO] [utils.py:830:see_memory_usage] MA 11.71 GB Max_MA 11.72 GB CA 11.75 GB Max_CA 12 GB
[2023-04-06 08:33:25,839] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 12.0 GB, percent = 4.8%
[2023-04-06 08:33:26,025] [INFO] [utils.py:829:see_memory_usage] After initializing optimizer states
[2023-04-06 08:33:26,025] [INFO] [utils.py:830:see_memory_usage] MA 11.76 GB Max_MA 11.82 GB CA 11.85 GB Max_CA 12 GB
[2023-04-06 08:33:26,026] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 12.0 GB, percent = 4.8%
[2023-04-06 08:33:26,026] [INFO] [stage_1_and_2.py:520:init] optimizer state initialized
[2023-04-06 08:33:26,092] [INFO] [utils.py:829:see_memory_usage] After initializing ZeRO optimizer
[2023-04-06 08:33:26,093] [INFO] [utils.py:830:see_memory_usage] MA 11.76 GB Max_MA 11.76 GB CA 11.85 GB Max_CA 12 GB
[2023-04-06 08:33:26,093] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 12.0 GB, percent = 4.8%
[2023-04-06 08:33:26,094] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-04-06 08:33:26,095] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-04-06 08:33:26,095] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-04-06 08:33:26,095] [INFO] [logging.py:93:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.999)]
[2023-04-06 08:33:26,096] [INFO] [config.py:1018:print] DeepSpeedEngine configuration:
[2023-04-06 08:33:26,096] [INFO] [config.py:1022:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-04-06 08:33:26,096] [INFO] [config.py:1022:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-04-06 08:33:26,096] [INFO] [config.py:1022:print] amp_enabled .................. False
[2023-04-06 08:33:26,096] [INFO] [config.py:1022:print] amp_params ................... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] bfloat16_enabled ............. True
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] checkpoint_parallel_write_pipeline False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] checkpoint_tag_validation_enabled True
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] checkpoint_tag_validation_fail False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f15140266a0>
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] communication_data_type ...... None
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] curriculum_enabled_legacy .... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] curriculum_params_legacy ..... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] data_efficiency_enabled ...... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] dataloader_drop_last ......... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] disable_allgather ............ False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] dump_state ................... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] dynamic_loss_scale_args ...... None
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_enabled ........... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_gas_boundary_resolution 1
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_layer_num ......... 0
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_max_iter .......... 100
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_stability ......... 1e-06
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_tol ............... 0.01
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] eigenvalue_verbose ........... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] elasticity_enabled ........... False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] fp16_auto_cast ............... None
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] fp16_enabled ................. False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] fp16_master_weights_and_gradients False
[2023-04-06 08:33:26,097] [INFO] [config.py:1022:print] global_rank .................. 0
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] grad_accum_dtype ............. None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] gradient_accumulation_steps .. 8
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] gradient_clipping ............ 0.0
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] gradient_predivide_factor .... 1.0
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] initial_dynamic_scale ........ 1
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] load_universal_checkpoint .... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] loss_scale ................... 1.0
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] memory_breakdown ............. False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] optimizer_legacy_fusion ...... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] optimizer_name ............... None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] optimizer_params ............. None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] pld_enabled .................. False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] pld_params ................... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] prescale_gradients ........... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] scheduler_name ............... None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] scheduler_params ............. None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] sparse_attention ............. None
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] sparse_gradients_enabled ..... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] steps_per_print .............. inf
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] train_batch_size ............. 32
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] train_micro_batch_size_per_gpu 1
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] use_node_local_storage ....... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] wall_clock_breakdown ......... False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] world_size ................... 4
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_allow_untested_optimizer True
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_enabled ................. True
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_force_ds_cpu_optimizer .. True
[2023-04-06 08:33:26,098] [INFO] [config.py:1022:print] zero_optimization_stage ...... 2
[2023-04-06 08:33:26,099] [INFO] [config.py:1007:print_user_config] json = {
"train_batch_size": 32,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 8,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none"
},
"offload_param": {
"device": "none"
},
"stage3_gather_16bit_weights_on_model_save": false
},
"steps_per_print": inf,
"bf16": {
"enabled": true
},
"fp16": {
"enabled": false
},
"zero_allow_untested_optimizer": true
}
Using /home/cike/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00032019615173339844 seconds
loss: 2.640625: 0%| | 1/12241 [00:04<14:19:20, 4.21s/it]
Traceback (most recent call last):
File "/home/cike/zzp/LoRA/ChatGLM-finetune-LoRA/train.py", line 220, in
accelerator.backward(loss)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1677, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2008, in backward
self.allreduce_gradients()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1918, in allreduce_gradients
self.optimizer.overlapping_partition_gradients_reduce_epilogue()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 834, in overlapping_partition_gradients_reduce_epilogue
self.independent_gradient_partition_epilogue()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 720, in independent_gradient_partition_epilogue
self.reduce_ipg_grads()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1287, in reduce_ipg_grads
self.average_tensor(self.ipg_buffer[self.ipg_index])
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1018, in average_tensor
tensor_to_reduce = tensor.to(self.communication_data_type)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 954.00 MiB (GPU 1; 15.90 GiB total capacity; 12.84 GiB already allocated; 927.75 MiB free; 14.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
loss: 2.75: 0%| | 1/12241 [00:04<15:35:54, 4.59s/it]
Traceback (most recent call last):
File "/home/cike/zzp/LoRA/ChatGLM-finetune-LoRA/train.py", line 220, in
accelerator.backward(loss)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/accelerator.py", line 1677, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2008, in backward
self.allreduce_gradients()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1918, in allreduce_gradients
self.optimizer.overlapping_partition_gradients_reduce_epilogue()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 834, in overlapping_partition_gradients_reduce_epilogue
self.independent_gradient_partition_epilogue()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 720, in independent_gradient_partition_epilogue
self.reduce_ipg_grads()
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1287, in reduce_ipg_grads
self.average_tensor(self.ipg_buffer[self.ipg_index])
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1018, in average_tensor
tensor_to_reduce = tensor.to(self.communication_data_type)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 954.00 MiB (GPU 3; 15.90 GiB total capacity; 12.88 GiB already allocated; 903.75 MiB free; 14.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107844 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107846 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 107845) of binary: /home/cike/anaconda/envs/lora/bin/python
Traceback (most recent call last):
File "/home/cike/anaconda/envs/lora/bin/accelerate", line 8, in
sys.exit(main())
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/commands/launch.py", line 908, in launch_command
deepspeed_launcher(args)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/accelerate/commands/launch.py", line 647, in deepspeed_launcher
distrib_run.run(args)
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/cike/anaconda/envs/lora/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
[1]:
time : 2023-04-06_08:33:33
host : 4d9275d5570f
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 107847)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2023-04-06_08:33:33
host : 4d9275d5570f
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 107845)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: