food101_12.txt

******* loading model args.model='vitsmart'
******* loading model args.model='vitsmart'
******* loading model args.model='vitsmart'
******* loading model args.model='vitsmart'
--> World Size = 4

--> Device_count = 4
--> running with these defaults train_config(seed=2022, verbose=True, total_steps_to_run=None, print_memory_summary=False, num_epochs=12, model_weights_bf16=False, use_mixed_precision=False, use_low_precision_gradient_policy=False, use_tf32=False, optimizer='AnyPrecision', ap_use_kahan_summation=False, sharding_strategy=<ShardingStrategy.FULL_SHARD: 1>, print_sharding_plan=False, run_profiler=False, profile_folder='fsdp/profile_tracing', log_every=1, num_workers_dataloader=2, batch_size_training=68, fsdp_activation_checkpointing=False, run_validation=True, memory_report=True, nccl_debug_handler=True, distributed_debug=True, use_non_recursive_wrapping=False, use_tp=False, image_size=224, use_synthetic_data=False, use_pokemon_dataset=False, use_beans_dataset=False, save_model_checkpoint=False, load_model_checkpoint=False, checkpoint_max_save_count=2, save_optimizer=False, load_optimizer=False, optimizer_checkpoint_file='Adam-vit--1.pt', checkpoint_model_filename='vit--1.pt')
clearing gpu cache for all ranks
--> running with torch dist debug set to detail
--> total memory per gpu (GB) = 22.0626
policy is None

--> Prepping vit_relpos_base_patch16_rpn_224 model ...

stats is ready....? _stats=defaultdict(<class 'list'>, {}), local_rank=0, rank=0
--> vit_relpos_base_patch16_rpn_224 built.
built model with 85.722869M params
--> Warning - bf16 support not available.  Using fp32
backward prefetch set to None
sharding set to ShardingStrategy.FULL_SHARD
--> Batch Size = 68
local rank 0 init time = 1.5426582659999895
memory stats reset, ready to track
Running with AnyPrecision Optimizer, momo=torch.float32, var = torch.float32, kahan summation =  False
Epoch: 1 starting...
step: 1: time taken for the last 1 steps is 5.56264149499998, loss is 4.556710243225098
step: 2: time taken for the last 1 steps is 0.26052744599996913, loss is 4.726100444793701
step: 3: time taken for the last 1 steps is 0.2623039449999851, loss is 4.740593910217285
step: 4: time taken for the last 1 steps is 0.25435064299995247, loss is 4.724956512451172
step: 5: time taken for the last 1 steps is 0.25499816000001374, loss is 4.608603477478027
step: 6: time taken for the last 1 steps is 0.2577160680000361, loss is 4.639563083648682
step: 7: time taken for the last 1 steps is 0.26127692399995794, loss is 4.620149612426758
step: 8: time taken for the last 1 steps is 0.24955331299997852, loss is 4.645939826965332
step: 9: time taken for the last 1 steps is 0.2596619569999348, loss is 4.646860599517822
step: 10: time taken for the last 1 steps is 0.25984457000004113, loss is 4.529321670532227
step: 11: time taken for the last 1 steps is 0.25722897200000716, loss is 4.625144958496094
step: 12: time taken for the last 1 steps is 0.250460291999957, loss is 4.520712375640869
step: 13: time taken for the last 1 steps is 0.2587341760000754, loss is 4.676128387451172
step: 14: time taken for the last 1 steps is 0.2602442829999063, loss is 4.481432914733887
step: 15: time taken for the last 1 steps is 0.25809503999994376, loss is 4.547496795654297
step: 16: time taken for the last 1 steps is 0.2514484820000007, loss is 4.657937526702881
step: 17: time taken for the last 1 steps is 0.2547555359999478, loss is 4.442592620849609
step: 18: time taken for the last 1 steps is 0.25838225300003614, loss is 4.521785736083984
step: 19: time taken for the last 1 steps is 0.2572551509999812, loss is 4.626273155212402
step: 20: time taken for the last 1 steps is 0.24741281000001436, loss is 4.532219409942627
step: 21: time taken for the last 1 steps is 0.2491831090000005, loss is 4.586917877197266
step: 22: time taken for the last 1 steps is 0.25791516799995406, loss is 4.631921291351318
step: 23: time taken for the last 1 steps is 0.2584072229999492, loss is 4.854366302490234
step: 24: time taken for the last 1 steps is 0.2600603800000272, loss is 4.5460991859436035
step: 25: time taken for the last 1 steps is 0.25610477999998693, loss is 4.5151166915893555
step: 26: time taken for the last 1 steps is 0.252611872999978, loss is 4.580170631408691
step: 27: time taken for the last 1 steps is 0.25130365000006805, loss is 4.560575485229492
step: 28: time taken for the last 1 steps is 0.2552491110000119, loss is 4.456767559051514
step: 29: time taken for the last 1 steps is 0.25457419300005313, loss is 4.499762535095215
step: 30: time taken for the last 1 steps is 0.25539462300002924, loss is 4.523378849029541
step: 31: time taken for the last 1 steps is 0.261071601000026, loss is 4.533198356628418
step: 32: time taken for the last 1 steps is 0.25383001700004115, loss is 4.610552787780762
step: 33: time taken for the last 1 steps is 0.251812176000044, loss is 4.555049896240234
step: 34: time taken for the last 1 steps is 0.2539758379999739, loss is 4.64456033706665
step: 35: time taken for the last 1 steps is 0.25775054700000055, loss is 4.410024166107178
step: 36: time taken for the last 1 steps is 0.26112936100003026, loss is 4.489560127258301
step: 37: time taken for the last 1 steps is 0.2572888920000196, loss is 4.469703674316406
step: 38: time taken for the last 1 steps is 0.26080332800006545, loss is 4.491578102111816
step: 39: time taken for the last 1 steps is 0.2587886480000634, loss is 4.50607967376709
step: 40: time taken for the last 1 steps is 0.26139379400001417, loss is 4.486103057861328
step: 41: time taken for the last 1 steps is 0.2530396180000025, loss is 4.702195644378662
step: 42: time taken for the last 1 steps is 0.25594204800006537, loss is 4.504721641540527
step: 43: time taken for the last 1 steps is 0.25631997200002843, loss is 4.42141056060791
step: 44: time taken for the last 1 steps is 0.2520109379998985, loss is 4.46730899810791
step: 45: time taken for the last 1 steps is 0.25859651600001143, loss is 4.464085102081299
step: 46: time taken for the last 1 steps is 0.26063370600002145, loss is 4.499287128448486
step: 47: time taken for the last 1 steps is 0.2636300570000003, loss is 4.5415873527526855
step: 48: time taken for the last 1 steps is 0.2627756279999858, loss is 4.5396623611450195
step: 49: time taken for the last 1 steps is 0.25401422800007367, loss is 4.574820518493652