[2021-02-04 14:59:44,368] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0]} [2021-02-04 14:59:44,368] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=1, node_rank=0 [2021-02-04 14:59:44,369] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(, {'localhost': [0]}) [2021-02-04 14:59:44,369] [INFO] [launch.py:102:main] dist_world_size=1 [2021-02-04 14:59:44,369] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0 [2021-02-04 14:59:46,514] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True [INFO|configuration_utils.py:449] 2021-02-04 14:59:47,556 >> loading configuration file https://huggingface.co/allenai/unifiedqa-t5-11b/resolve/main/config.json from cache at /home/pajansen/.cache/huggingface/transformers/860dc660b5b7b0c49f50c4a0ee40c3935cd03d3dfea24e1b10807c87069bcb98.b9d2b0ab4e2b4b5d61c14260d2fc20610056681abdbecfeeca4336997af53ba4 [INFO|configuration_utils.py:485] 2021-02-04 14:59:47,557 >> Model config T5Config { "architectures": [ "T5ForConditionalGeneration" ], "d_ff": 65536, "d_kv": 128, "d_model": 1024, "decoder_start_token_id": 0, "dropout_rate": 0.1, "eos_token_id": 1, "feed_forward_proj": "relu", "initializer_factor": 1.0, "is_encoder_decoder": true, "layer_norm_epsilon": 1e-06, "model_type": "t5", "n_positions": 512, "num_decoder_layers": 24, "num_heads": 128, "num_layers": 24, "output_past": true, "pad_token_id": 0, "relative_attention_num_buckets": 32, "task_specific_params": { "summarization": { "early_stopping": true, "length_penalty": 2.0, "max_length": 200, "min_length": 30, "no_repeat_ngram_size": 3, "num_beams": 4, "prefix": "summarize: " }, "translation_en_to_de": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to German: " }, "translation_en_to_fr": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to French: " }, "translation_en_to_ro": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to Romanian: " } }, "transformers_version": "4.3.0.dev0", "use_cache": true, "vocab_size": 32128 } [INFO|configuration_utils.py:449] 2021-02-04 14:59:47,753 >> loading configuration file https://huggingface.co/allenai/unifiedqa-t5-11b/resolve/main/config.json from cache at /home/pajansen/.cache/huggingface/transformers/860dc660b5b7b0c49f50c4a0ee40c3935cd03d3dfea24e1b10807c87069bcb98.b9d2b0ab4e2b4b5d61c14260d2fc20610056681abdbecfeeca4336997af53ba4 [INFO|configuration_utils.py:485] 2021-02-04 14:59:47,754 >> Model config T5Config { "architectures": [ "T5ForConditionalGeneration" ], "d_ff": 65536, "d_kv": 128, "d_model": 1024, "decoder_start_token_id": 0, "dropout_rate": 0.1, "eos_token_id": 1, "feed_forward_proj": "relu", "initializer_factor": 1.0, "is_encoder_decoder": true, "layer_norm_epsilon": 1e-06, "model_type": "t5", "n_positions": 512, "num_decoder_layers": 24, "num_heads": 128, "num_layers": 24, "output_past": true, "pad_token_id": 0, "relative_attention_num_buckets": 32, "task_specific_params": { "summarization": { "early_stopping": true, "length_penalty": 2.0, "max_length": 200, "min_length": 30, "no_repeat_ngram_size": 3, "num_beams": 4, "prefix": "summarize: " }, "translation_en_to_de": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to German: " }, "translation_en_to_fr": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to French: " }, "translation_en_to_ro": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to Romanian: " } }, "transformers_version": "4.3.0.dev0", "use_cache": true, "vocab_size": 32128 } [INFO|tokenization_utils_base.py:1685] 2021-02-04 14:59:47,754 >> Model name 'allenai/unifiedqa-t5-11b' not found in model shortcut name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). Assuming 'allenai/unifiedqa-t5-11b' is a path, a model identifier, or url to a directory containing tokenizer files. [INFO|tokenization_utils_base.py:1786] 2021-02-04 14:59:48,718 >> loading file https://huggingface.co/allenai/unifiedqa-t5-11b/resolve/main/spiece.model from cache at /home/pajansen/.cache/huggingface/transformers/89fd9a0d451f42bbb7f4ffe2c1406466a9c3fe93359ab4c65f5b6beee105c89f.3b69006860e7b5d0a63ffdddc01ddcd6b7c318a6f4fd793596552c741734c62d [INFO|tokenization_utils_base.py:1786] 2021-02-04 14:59:48,718 >> loading file https://huggingface.co/allenai/unifiedqa-t5-11b/resolve/main/tokenizer.json from cache at None [INFO|tokenization_utils_base.py:1786] 2021-02-04 14:59:48,718 >> loading file https://huggingface.co/allenai/unifiedqa-t5-11b/resolve/main/added_tokens.json from cache at None [INFO|tokenization_utils_base.py:1786] 2021-02-04 14:59:48,719 >> loading file https://huggingface.co/allenai/unifiedqa-t5-11b/resolve/main/special_tokens_map.json from cache at /home/pajansen/.cache/huggingface/transformers/5085c921692f441a9f3c3b90937633216be60968da234e8f839d8015650e9012.c94798918c92ded6aeef2d2f0e666d2cc4145eca1aa6e1336fde07f2e13e2f46 [INFO|tokenization_utils_base.py:1786] 2021-02-04 14:59:48,719 >> loading file https://huggingface.co/allenai/unifiedqa-t5-11b/resolve/main/tokenizer_config.json from cache at /home/pajansen/.cache/huggingface/transformers/f672d5c04f8f1c976619c04cc6e81b1150e00ab27489c9f4b38856e9d54c58b7.024cc07195c0ba0b51d4f80061c6115996ff26233f3d04788855b23cdf13fbd5 [INFO|modeling_utils.py:1027] 2021-02-04 14:59:49,043 >> loading weights file https://huggingface.co/allenai/unifiedqa-t5-11b/resolve/main/pytorch_model.bin from cache at /home/pajansen/.cache/huggingface/transformers/f287abf8c1cc1e83acbfbbd62d800cc2d22888ce0a77c136c5d8d70bc813f706.db0f74942af7194f63ec562faec0fbc0579b31f7892da4ea9e08101b74387616 [INFO|modeling_utils.py:1143] 2021-02-04 15:04:28,197 >> All model checkpoint weights were used when initializing T5ForConditionalGeneration. [INFO|modeling_utils.py:1151] 2021-02-04 15:04:28,197 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at allenai/unifiedqa-t5-11b. If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training. [INFO|trainer.py:348] 2021-02-04 15:04:28,234 >> Using amp fp16 backend /home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py:702: FutureWarning: `model_path` is deprecated and will be removed in a future version. Use `resume_from_checkpoint` instead. warnings.warn( [INFO|integrations.py:311] 2021-02-04 15:04:28,235 >> Keeping the `optimizer` config from ds_config.json intact, ignoring any optimizer-specific cl args [INFO|integrations.py:344] 2021-02-04 15:04:28,235 >> Keeping the `scheduler` config from ds_config.json intact, ignoring any scheduler-specific cl args [INFO|integrations.py:389] 2021-02-04 15:04:28,235 >> Keeping the `fp16` config from ds_config.json intact, ignoring any fp16-specific cl args [2021-02-04 15:04:28,236] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.11+4f1d827, git-hash=4f1d827, git-branch=master [2021-02-04 15:04:52,904] [INFO] [engine.py:73:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1 Using /home/pajansen/.cache/torch_extensions as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/pajansen/.cache/torch_extensions/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 0.8016488552093506 seconds Adam Optimizer #0 is created with scalar arithmetic capability. Config: alpha=0.000030, betas=(0.800000, 0.999000), weight_decay=0.000000, adam_w=1 [2021-02-04 15:04:58,783] [INFO] [engine.py:551:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer [2021-02-04 15:04:58,783] [INFO] [engine.py:556:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam ( Parameter Group 0 amsgrad: False betas: [0.8, 0.999] bias_correction: True eps: 1e-08 lr: 3e-05 weight_decay: 3e-07 ) Checking ZeRO support for optimizer=DeepSpeedCPUAdam type= [2021-02-04 15:04:58,784] [INFO] [engine.py:672:_configure_zero_optimizer] Creating fp16 ZeRO stage 2 optimizer Using /home/pajansen/.cache/torch_extensions as PyTorch extensions root... Emitting ninja build file /home/pajansen/.cache/torch_extensions/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.4564859867095947 seconds [2021-02-04 15:04:59,241] [INFO] [stage2.py:130:__init__] Reduce bucket size 300000000.0 [2021-02-04 15:04:59,241] [INFO] [stage2.py:131:__init__] Allgather bucket size 300000000.0 [2021-02-04 15:04:59,241] [INFO] [stage2.py:132:__init__] CPU Offload: True group 0 param 0 = 11274422272 [2021-02-04 15:05:54,960] [INFO] [stage2.py:399:__init__] optimizer state initialized [2021-02-04 15:05:54,961] [INFO] [engine.py:586:_configure_optimizer] DeepSpeed Final Optimizer = [2021-02-04 15:05:54,961] [INFO] [engine.py:405:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR [2021-02-04 15:05:54,961] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2021-02-04 15:05:54,961] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[3e-05], mom=[[0.8, 0.999]] [2021-02-04 15:05:54,961] [INFO] [config.py:733:print] DeepSpeedEngine configuration: [2021-02-04 15:05:54,961] [INFO] [config.py:737:print] activation_checkpointing_config [2021-02-04 15:05:54,961] [INFO] [config.py:737:print] allreduce_always_fp32 ........ False [2021-02-04 15:05:54,961] [INFO] [config.py:737:print] amp_enabled .................. False [2021-02-04 15:05:54,961] [INFO] [config.py:737:print] amp_params ................... False [2021-02-04 15:05:54,961] [INFO] [config.py:737:print] checkpoint_tag_validation_enabled True [2021-02-04 15:05:54,961] [INFO] [config.py:737:print] checkpoint_tag_validation_fail False [2021-02-04 15:05:54,961] [INFO] [config.py:737:print] disable_allgather ............ False [2021-02-04 15:05:54,961] [INFO] [config.py:737:print] dump_state ................... False [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1} [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] elasticity_enabled ........... False [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] flops_profiler_config ........ [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] fp16_enabled ................. True [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] global_rank .................. 0 [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] gradient_accumulation_steps .. 1 [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] gradient_clipping ............ 1.0 [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] gradient_predivide_factor .... 1.0 [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] initial_dynamic_scale ........ 4294967296 [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] loss_scale ................... 0 [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] memory_breakdown ............. False [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] optimizer_legacy_fusion ...... False [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] optimizer_name ............... adamw [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] optimizer_params ............. {'lr': 3e-05, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07} [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] pld_enabled .................. False [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] pld_params ................... False [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] prescale_gradients ........... False [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] scheduler_name ............... WarmupLR [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 3e-05, 'warmup_num_steps': 500} [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] sparse_attention ............. None [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] sparse_gradients_enabled ..... False [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] steps_per_print .............. 2000 [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] tensorboard_enabled .......... False [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] tensorboard_job_name ......... DeepSpeedJobName [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] tensorboard_output_path ...... [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] train_batch_size ............. 1 [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] train_micro_batch_size_per_gpu 1 [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] wall_clock_breakdown ......... False [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] world_size ................... 1 [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] zero_allow_untested_optimizer True [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] zero_config .................. { "allgather_bucket_size": 300000000.0, "allgather_partitions": true, "contiguous_gradients": true, "cpu_offload": true, "elastic_checkpoint": true, "load_from_fp32_weights": true, "overlap_comm": true, "reduce_bucket_size": 300000000.0, "reduce_scatter": true, "stage": 2 } [2021-02-04 15:05:54,962] [INFO] [config.py:737:print] zero_enabled ................. True [2021-02-04 15:05:54,963] [INFO] [config.py:737:print] zero_optimization_stage ...... 2 [2021-02-04 15:05:54,963] [INFO] [config.py:739:print] json = { "fp16":{ "enabled":true, "hysteresis":2, "loss_scale":0, "loss_scale_window":1000, "min_loss_scale":1 }, "gradient_accumulation_steps":1, "gradient_clipping":1.0, "optimizer":{ "params":{ "betas":[ 0.8, 0.999 ], "eps":1e-08, "lr":3e-05, "weight_decay":3e-07 }, "type":"AdamW" }, "scheduler":{ "params":{ "warmup_max_lr":3e-05, "warmup_min_lr":0, "warmup_num_steps":500 }, "type":"WarmupLR" }, "steps_per_print":2000, "train_micro_batch_size_per_gpu":1, "wall_clock_breakdown":false, "zero_allow_untested_optimizer":true, "zero_optimization":{ "allgather_bucket_size":300000000.0, "allgather_partitions":true, "contiguous_gradients":true, "cpu_offload":true, "overlap_comm":true, "reduce_bucket_size":300000000.0, "reduce_scatter":true, "stage":2 } } Using /home/pajansen/.cache/torch_extensions as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0005221366882324219 seconds [INFO|trainer.py:837] 2021-02-04 15:05:54,964 >> ***** Running training ***** [INFO|trainer.py:838] 2021-02-04 15:05:54,964 >> Num examples = 592 [INFO|trainer.py:839] 2021-02-04 15:05:54,964 >> Num Epochs = 2 [INFO|trainer.py:840] 2021-02-04 15:05:54,964 >> Instantaneous batch size per device = 1 [INFO|trainer.py:841] 2021-02-04 15:05:54,964 >> Total train batch size (w. parallel, distributed & accumulation) = 1 [INFO|trainer.py:842] 2021-02-04 15:05:54,964 >> Gradient Accumulation steps = 1 [INFO|trainer.py:843] 2021-02-04 15:05:54,964 >> Total optimization steps = 1184 0%| | 0/1184 [00:00 main() File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 161, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/pajansen/anaconda3/envs/transformers-feb4-2020/bin/python', '-u', './finetune_trainer.py', '--local_rank=0', '--model_name_or_path', 'allenai/unifiedqa-t5-11b', '--output_dir', 'output_dir_compexpl-feb4-epoch2-uqa-11b-wholetree-rev', '--adam_eps', '1e-06', '--data_dir', '/home/pajansen/github/compositional-expl/data/feb4-initialtest-q693/wholetree-rev/', '--do_eval', '--do_predict', '--do_train', '--evaluation_strategy=steps', '--freeze_embeds', '--label_smoothing', '0.1', '--learning_rate', '3e-5', '--logging_first_step', '--logging_steps', '1000', '--max_source_length', '128', '--max_target_length', '128', '--num_train_epochs', '2', '--overwrite_output_dir', '--per_device_eval_batch_size', '1', '--per_device_train_batch_size', '1', '--predict_with_generate', '--sortish_sampler', '--test_max_target_length', '128', '--val_max_target_length', '128', '--warmup_steps', '5', '--deepspeed', 'ds_config.json', '--fp16']' died with . Command being timed: "deepspeed --num_gpus=1 ./finetune_trainer.py --model_name_or_path allenai/unifiedqa-t5-11b --output_dir output_dir_compexpl-feb4-epoch2-uqa-11b-wholetree-rev --adam_eps 1e-06 --data_dir /home/pajansen/github/compositional-expl/data/feb4-initialtest-q693/wholetree-rev/ --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 2 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --sortish_sampler --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --deepspeed ds_config.json --fp16" User time (seconds): 1152.16 System time (seconds): 746.75 Percent of CPU this job got: 396% Elapsed (wall clock) time (h:mm:ss or m:ss): 7:58.47 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 233292336 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 108071918 Voluntary context switches: 38621 Involuntary context switches: 588867 Swaps: 0 File system inputs: 0 File system outputs: 48 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0