GPU memory cost #19

VLadImirluren · 2024-07-27T04:49:17Z

Could you share the memory cost by different stage and different dataset?

I met the problem that I train fine VAE on 80GB GPU with batch_size=1 will OOM at epoch0

VLadImirluren · 2024-07-27T13:39:11Z

I met the problem :

I know you have written the code to skip OOM batch

BUT it just warning and no error no failed and no terminate and the GPU memory will not released and no move forward...

How should I deal with this problem?
(Except hand craft and I am reproducing the result show in paper so load your ckpt is not a solution too)

VLadImirluren · 2024-07-27T14:42:43Z

Finetune on your ckpt still Runtime Error

nohup: ignoring input
2024-07-27 14:39:26.122 | INFO | main::171 - This is train_auto.py! Please note that you should use 300 instead of 300.0 for resuming.
git root error: Cmd('git') failed due to: exit code(128)
cmdline: git rev-parse --show-toplevel
stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube'
To add an exception for this directory, call:

git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'

git root error: Cmd('git') failed due to: exit code(128)
cmdline: git rev-parse --show-toplevel
stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube'
To add an exception for this directory, call:

git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'

wandb: Currently logged in as: 13532152291 (13532152291-sun-yat-sen-university). Use `wandb login --relogin` to force relogin
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: Tracking run with wandb version 0.17.3
wandb: Run data is saved locally in ../wandb/wandb/run-20240727_143927-afca2fj3
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run chair_VAE_sparse/512_to_128-kld-1.0
wandb: ⭐️ View project at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet
wandb: 🚀 View run at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/afca2fj3
[rank: 0] Global seed set to 0
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2024-07-27 14:39:42.125 | INFO | xcube.modules.autoencoding.sunet:init:241 - latent dim: 8
[rank: 0] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

dk-process-data-master-0:84258:84258 [0] NCCL INFO Bootstrap : Using eth0:172.16.28.236<0>
dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
dk-process-data-master-0:84258:84258 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda12.3
Restoring states from the checkpoint path at /mnt/pfs/users/dengken/code/XCube/checkpoints/chair_download/fine_vae/last.ckpt
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1452: UserWarning: Be aware that when using ckpt_path, callbacks used to create the checkpoint need to be provided during Trainer instantiation. Please add the following callbacks: ["ModelCheckpoint{'monitor': 'val_step', 'mode': 'max', 'every_n_train_steps': 5000, 'every_n_epochs': 0, 'train_time_interval': None}"].
rank_zero_warn(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | encoder | Encoder | 1.2 K
1 | unet | StructPredictionNet | 3.8 M
2 | loss | Loss | 0

3.8 M Trainable params
0 Non-trainable params
3.8 M Total params
15.203 Total estimated model params size (MB)
Restored all states from the checkpoint file at /mnt/pfs/users/dengken/code/XCube/checkpoints/chair_download/fine_vae/last.ckpt

======= MODEL HYPER-PARAMETERS ======= <<<<
exec: null
include: null
test_set_shuffle: false
batch_size: 1
accumulate_grad_batches: 32
visualize: false
name: shapenet/chair_VAE_sparse
model: autoencoder
tree_depth: 3
voxel_size:

0.0025
0.0025
0.0025
resolution: 512
use_fvdb_loader: true
use_hash_tree: true
use_input_normal: true
use_input_semantic: false
use_input_intensity: false
cut_ratio: 16
kl_weight: 1.0
normalize_kld: true
enable_anneal: false
kl_weight_min: 1.0e-07
kl_weight_max: 1.0
anneal_star_iter: 0
anneal_end_iter: 70000
supervision:
structure_weight: 20.0
normal_weight: 300.0
color_weight: 0.0
semantic_weight: 0.0
optimizer: Adam
learning_rate:
init: 0.0001
decay_mult: 0.7
decay_step: 50000
clip: 1.0e-06
weight_decay: 0.0
grad_clip: 0.5
network:
encoder:
c_dim: 32
unet:
target: StructPredictionNet
params:
in_channels: 32
num_blocks: 3
f_maps: 32
neck_dense_type: UNCHANGED
neck_bound:
- 64
- 64
- 64
num_res_blocks: 1
use_residual: false
order: gcr
is_add_dec: false
use_attention: false
use_checkpoint: false
_shapenet_path: ../data/shapenet/
_shapenet_categories:
'03001627'
_shapenet_custom_name: shapenet
train_dataset: ShapeNetDataset
train_val_num_workers: 0
train_kwargs:
onet_base_path: ../data/shapenet/
resolution: 512
categories:
- '03001627'
  custom_name: shapenet
  split: train
  random_seed: 0
  val_dataset: ShapeNetDataset
  val_kwargs:
  onet_base_path: ../data/shapenet/
  resolution: 512
  categories:
- '03001627'
  custom_name: shapenet
  split: val
  random_seed: fixed
  test_dataset: ShapeNetDataset
  test_num_workers: 0
  test_kwargs:
  onet_base_path: ../data/shapenet/
  resolution: 512
  categories:
- '03001627'
  custom_name: shapenet
  split: test
  random_seed: fixed
  remain_h: false
  pretrained_weight: null
  use_input_color: false
  with_color_branch: false
  with_normal_branch: true
  with_semantic_branch: false

====================================== <<<<

Sanity Checking: 0it [00:00, ?it/s]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 128 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
rank_zero_warn(
dk-process-data-master-0:84258:86092 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
dk-process-data-master-0:84258:86092 [0] NCCL INFO P2P plugin IBext
dk-process-data-master-0:84258:86092 [0] NCCL INFO NET/IB : No device found.
dk-process-data-master-0:84258:86092 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dk-process-data-master-0:84258:86092 [0] NCCL INFO NET/IB : No device found.
dk-process-data-master-0:84258:86092 [0] NCCL INFO NET/Socket : Using [0]eth0:172.16.28.236<0>
dk-process-data-master-0:84258:86092 [0] NCCL INFO Using non-device net plugin version 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Using network Socket
dk-process-data-master-0:84258:86092 [0] NCCL INFO comm 0x561eb43b3e00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId ad000 commId 0x68b3dc29606196e0 - Init START
dk-process-data-master-0:84258:86092 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
dk-process-data-master-0:84258:86092 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff,00000000
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 00/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 01/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 02/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 03/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 04/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 05/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 06/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 07/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 08/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 09/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 10/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 11/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 12/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 13/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 14/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 15/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 16/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 17/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 18/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 19/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 20/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 21/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 22/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 23/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 24/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 25/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 26/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 27/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 28/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 29/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 30/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 31/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
dk-process-data-master-0:84258:86092 [0] NCCL INFO P2P Chunksize set to 131072
dk-process-data-master-0:84258:86092 [0] NCCL INFO Connected all rings
dk-process-data-master-0:84258:86092 [0] NCCL INFO Connected all trees
dk-process-data-master-0:84258:86092 [0] NCCL INFO 32 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dk-process-data-master-0:84258:86092 [0] NCCL INFO comm 0x561eb43b3e00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId ad000 commId 0x68b3dc29606196e0 - Init COMPLETE

Sanity Checking: 0%| | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/data.py:84: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 1016724. To avoid any miscalculations, use self.log(..., batch_size=batch_size).
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called self.log('val_step', ...) in your validation_step but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(

Sanity Checking DataLoader 0: 50%|█████ | 1/2 [00:02<00:02, 2.27s/it]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/data.py:84: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 507080. To avoid any miscalculations, use self.log(..., batch_size=batch_size).
warning_cache.warn(

Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:02<00:00, 1.32s/it]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_metric/struct-acc-2', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_metric/struct-acc-1', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_metric/struct-acc-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/struct-2', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/struct-1', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/struct-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/normal', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/kld', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/mu-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/logvar-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/kld-true-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/kld-total-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_step', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(

/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 128 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
rank_zero_warn(

Training: 594it [00:00, ?it/s]
Training: 0%| | 0/6271 [00:00<00:00, -20590219.64it/s]
Epoch 100: 0%| | 0/6271 [00:00<?, ?it/s] /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called self.log('val_step', ...) in your training_step but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(

Epoch 100: 0%| | 1/6271 [00:01<3:04:46, 1.77s/it]
Epoch 100: 0%| | 1/6271 [00:01<3:04:54, 1.77s/it, loss=18.8, v_num=2fj3]
Epoch 100: 0%| | 2/6271 [00:03<3:27:35, 1.99s/it, loss=18.8, v_num=2fj3]
Epoch 100: 0%| | 2/6271 [00:03<3:27:39, 1.99s/it, loss=26.3, v_num=2fj3]
Epoch 100: 0%| | 3/6271 [00:04<2:45:00, 1.58s/it, loss=26.3, v_num=2fj3]
Epoch 100: 0%| | 3/6271 [00:04<2:45:02, 1.58s/it, loss=23.6, v_num=2fj3]
Epoch 100: 0%| | 4/6271 [00:05<2:14:35, 1.29s/it, loss=23.6, v_num=2fj3]
Epoch 100: 0%| | 4/6271 [00:05<2:14:37, 1.29s/it, loss=23.9, v_num=2fj3]
Epoch 100: 0%| | 5/6271 [00:05<1:56:01, 1.11s/it, loss=23.9, v_num=2fj3]
Epoch 100: 0%| | 5/6271 [00:05<1:56:03, 1.11s/it, loss=23.5, v_num=2fj3]
Epoch 100: 0%| | 6/6271 [00:07<2:12:47, 1.27s/it, loss=23.5, v_num=2fj3]
Epoch 100: 0%| | 6/6271 [00:07<2:12:48, 1.27s/it, loss=24.2, v_num=2fj3]
Epoch 100: 0%| | 7/6271 [00:08<2:06:06, 1.21s/it, loss=24.2, v_num=2fj3]
Epoch 100: 0%| | 7/6271 [00:08<2:06:07, 1.21s/it, loss=24.2, v_num=2fj3]
Epoch 100: 0%| | 8/6271 [00:08<1:56:14, 1.11s/it, loss=24.2, v_num=2fj3]
Epoch 100: 0%| | 8/6271 [00:08<1:56:15, 1.11s/it, loss=23.1, v_num=2fj3]
Epoch 100: 0%| | 9/6271 [00:09<1:52:52, 1.08s/it, loss=23.1, v_num=2fj3]
Epoch 100: 0%| | 9/6271 [00:09<1:52:53, 1.08s/it, loss=23, v_num=2fj3]
Epoch 100: 0%| | 10/6271 [00:10<1:52:31, 1.08s/it, loss=23, v_num=2fj3]
Epoch 100: 0%| | 10/6271 [00:10<1:52:31, 1.08s/it, loss=22.9, v_num=2fj3]
Epoch 100: 0%| | 11/6271 [00:11<1:48:02, 1.04s/it, loss=22.9, v_num=2fj3]
Epoch 100: 0%| | 11/6271 [00:11<1:48:02, 1.04s/it, loss=22.5, v_num=2fj3]
Epoch 100: 0%| | 12/6271 [00:11<1:42:54, 1.01it/s, loss=22.5, v_num=2fj3]
Epoch 100: 0%| | 12/6271 [00:11<1:42:54, 1.01it/s, loss=23, v_num=2fj3]
Epoch 100: 0%| | 13/6271 [00:12<1:39:30, 1.05it/s, loss=23, v_num=2fj3]
Epoch 100: 0%| | 13/6271 [00:12<1:39:30, 1.05it/s, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 14/6271 [00:13<1:41:52, 1.02it/s, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 14/6271 [00:13<1:41:52, 1.02it/s, loss=22.8, v_num=2fj3]
Epoch 100: 0%| | 15/6271 [00:14<1:38:20, 1.06it/s, loss=22.8, v_num=2fj3]
Epoch 100: 0%| | 15/6271 [00:14<1:38:21, 1.06it/s, loss=22.2, v_num=2fj3]
Epoch 100: 0%| | 16/6271 [00:15<1:42:53, 1.01it/s, loss=22.2, v_num=2fj3]
Epoch 100: 0%| | 16/6271 [00:15<1:42:53, 1.01it/s, loss=23.2, v_num=2fj3]
Epoch 100: 0%| | 17/6271 [00:16<1:40:30, 1.04it/s, loss=23.2, v_num=2fj3]
Epoch 100: 0%| | 17/6271 [00:16<1:40:30, 1.04it/s, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 18/6271 [00:18<1:45:25, 1.01s/it, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 18/6271 [00:18<1:45:25, 1.01s/it, loss=23.6, v_num=2fj3]
Epoch 100: 0%| | 19/6271 [00:19<1:47:14, 1.03s/it, loss=23.6, v_num=2fj3]
Epoch 100: 0%| | 19/6271 [00:19<1:47:15, 1.03s/it, loss=23.5, v_num=2fj3]
Epoch 100: 0%| | 20/6271 [00:20<1:44:26, 1.00s/it, loss=23.5, v_num=2fj3]
Epoch 100: 0%| | 20/6271 [00:20<1:44:26, 1.00s/it, loss=23.4, v_num=2fj3]
Epoch 100: 0%| | 21/6271 [00:21<1:44:16, 1.00s/it, loss=23.4, v_num=2fj3]
Epoch 100: 0%| | 21/6271 [00:21<1:44:17, 1.00s/it, loss=23.3, v_num=2fj3]
Epoch 100: 0%| | 22/6271 [00:22<1:45:58, 1.02s/it, loss=23.3, v_num=2fj3]
Epoch 100: 0%| | 22/6271 [00:22<1:45:59, 1.02s/it, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 23/6271 [00:22<1:43:30, 1.01it/s, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 23/6271 [00:22<1:43:30, 1.01it/s, loss=22.8, v_num=2fj3]
Epoch 100: 0%| | 24/6271 [00:24<1:44:31, 1.00s/it, loss=22.8, v_num=2fj3]
Epoch 100: 0%| | 24/6271 [00:24<1:44:31, 1.00s/it, loss=23.1, v_num=2fj3]
Epoch 100: 0%| | 25/6271 [00:24<1:42:20, 1.02it/s, loss=23.1, v_num=2fj3]
Epoch 100: 0%| | 25/6271 [00:24<1:42:21, 1.02it/s, loss=22.9, v_num=2fj3]
Epoch 100: 0%| | 26/6271 [00:25<1:40:29, 1.04it/s, loss=22.9, v_num=2fj3]
Epoch 100: 0%| | 26/6271 [00:25<1:40:30, 1.04it/s, loss=22.6, v_num=2fj3]
Epoch 100: 0%| | 27/6271 [00:25<1:38:16, 1.06it/s, loss=22.6, v_num=2fj3]
Epoch 100: 0%| | 27/6271 [00:25<1:38:16, 1.06it/s, loss=22.5, v_num=2fj3]
Epoch 100: 0%| | 28/6271 [00:26<1:36:59, 1.07it/s, loss=22.5, v_num=2fj3]
Epoch 100: 0%| | 28/6271 [00:26<1:36:59, 1.07it/s, loss=22.6, v_num=2fj3]
Epoch 100: 0%| | 29/6271 [00:26<1:35:29, 1.09it/s, loss=22.6, v_num=2fj3]
Epoch 100: 0%| | 29/6271 [00:26<1:35:30, 1.09it/s, loss=22.2, v_num=2fj3]
Epoch 100: 0%| | 30/6271 [00:27<1:34:08, 1.10it/s, loss=22.2, v_num=2fj3]
Epoch 100: 0%| | 30/6271 [00:27<1:34:08, 1.10it/s, loss=22.3, v_num=2fj3]
Epoch 100: 0%| | 31/6271 [00:29<1:37:20, 1.07it/s, loss=22.3, v_num=2fj3]
Epoch 100: 0%| | 31/6271 [00:29<1:37:20, 1.07it/s, loss=22.8, v_num=2fj3][rank0]:[W reducer.cpp:1360] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Traceback (most recent call last):
File "/mnt/pfs/users/dengken/code/XCube/train.py", line 407, in
trainer.fit(net_model, ckpt_path=last_ckpt_path)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 213, in advance
batch_output = self.batch_loop.run(kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 202, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 249, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 370, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1742, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 280, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
return self.precision_plugin.optimizer_step(
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 119, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
return wrapped(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/optimizer.py", line 385, in wrapper
out = func(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/adamw.py", line 187, in step
adamw(
File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/adamw.py", line 339, in adamw
func(
File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/adamw.py", line 549, in _multi_tensor_adamw
torch.foreach_lerp(device_exp_avgs, device_grads, 1 - beta1)
RuntimeError: The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1
Training Finished. Best path = ../wandb/xcube-shapenet/afca2fj3/checkpoints/epoch=000100-step=000029700.ckpt
wandb: - 0.014 MB of 0.014 MB uploaded
wandb: \ 0.019 MB of 0.042 MB uploaded
wandb: | 0.036 MB of 0.042 MB uploaded
wandb: 🚀 View run chair_VAE_sparse/512_to_128-kld-1.0 at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/afca2fj3
wandb: ⭐️ View project at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ../wandb/wandb/run-20240727_143927-afca2fj3/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.
Exception ignored in: <function tqdm.del at 0x7f5980fb2ca0>
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1152, in del
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1306, in close
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1499, in display
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1155, in str
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1457, in format_dict
TypeError: cannot unpack non-iterable NoneType object
dk-process-data-master-0:84258:86123 [0] NCCL INFO [Service thread] Connection closed by localRank 0
dk-process-data-master-0:84258:84258 [0] NCCL INFO comm 0x561eb43b3e00 rank 0 nranks 1 cudaDev 0 busId ad000 - Abort COMPLETE

xrenaa · 2024-07-29T15:29:23Z

Hi, could you watch your GPU memory usage by watch nvidia-smi?
Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1?
Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

VLadImirluren · 2024-07-31T02:58:05Z

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

I always monitor by "watch -n 0.1 nvidia-smi"

For the first error, I met it many times. After it occur, the program would
"[rank0]:[W CUDAGuardImpl.h:115] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
Aborted (core dumped)",
"watch -n 0.1 nvidia-smi" will stop at

and the GPU memory will not released so I need to "pkill -f python" to release it.

For the second problem, I don't what's the reason because I just load the ckpt and finetune it (NO CODE BEING CHANGED)
I don't know why, because the ckpt is given by you and I have no way to make sure the ckpt have any problem in finetuning.

I am not sure it is a OOM problem, because "[rank0]:[W CUDAGuardImpl.h:115] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)" give no information about it, even no log file.
I believe that you have met this problem before, could you please give me some instruction or prompt about it?

At last, I will try the solution you give "remove the sample that triggers OOM.".

Thanks for your reply

VLadImirluren · 2024-07-31T03:10:29Z

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

The link to shapenet dataset is empty.
please at least release the list...

xrenaa · 2024-07-31T05:09:43Z

Could you try https://drive.google.com/file/d/1PQmSomS1B7UR7wNuqp5RtgkdXo7stKzG/view?usp=sharing?

VLadImirluren · 2024-07-31T05:24:58Z

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

I just try your suggestion.
It is not a good idea to deal with this problem.
Because it is not a problem with any sample, it is about the whole batch (batchsize * gradient_accumulation).

VLadImirluren · 2024-07-31T05:25:41Z

Could you try https://drive.google.com/file/d/1PQmSomS1B7UR7wNuqp5RtgkdXo7stKzG/view?usp=sharing?

I sent an application, please check it.Thanks

LeoDarcy · 2024-10-09T13:48:17Z

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

I just try your suggestion. It is not a good idea to deal with this problem. Because it is not a problem with any sample, it is about the whole batch (batchsize * gradient_accumulation).

Hi, I have the same problem. Have you solved this problem? I have tried to reduce the batch size and removed some samples, but it doesn't work. It occurs in the second epoch.

VLadImirluren · 2024-10-10T12:21:43Z

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

I just try your suggestion. It is not a good idea to deal with this problem. Because it is not a problem with any sample, it is about the whole batch (batchsize * gradient_accumulation).

Hi, I have the same problem. Have you solved this problem? I have tried to reduce the batch size and removed some samples, but it doesn't work. It occurs in the second epoch.

NO!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory cost #19

GPU memory cost #19

VLadImirluren commented Jul 27, 2024

VLadImirluren commented Jul 27, 2024

VLadImirluren commented Jul 27, 2024

xrenaa commented Jul 29, 2024 •

edited

Loading

VLadImirluren commented Jul 31, 2024

VLadImirluren commented Jul 31, 2024

xrenaa commented Jul 31, 2024

VLadImirluren commented Jul 31, 2024

VLadImirluren commented Jul 31, 2024

LeoDarcy commented Oct 9, 2024

VLadImirluren commented Oct 10, 2024

GPU memory cost #19

GPU memory cost #19

Comments

VLadImirluren commented Jul 27, 2024

VLadImirluren commented Jul 27, 2024

VLadImirluren commented Jul 27, 2024

distributed_backend=nccl All distributed processes registered. Starting with 1 processes

| Name | Type | Params

0 | encoder | Encoder | 1.2 K 1 | unet | StructPredictionNet | 3.8 M 2 | loss | Loss | 0

xrenaa commented Jul 29, 2024 • edited Loading

VLadImirluren commented Jul 31, 2024

VLadImirluren commented Jul 31, 2024

xrenaa commented Jul 31, 2024

VLadImirluren commented Jul 31, 2024

VLadImirluren commented Jul 31, 2024

LeoDarcy commented Oct 9, 2024

VLadImirluren commented Oct 10, 2024

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

0 | encoder | Encoder | 1.2 K
1 | unet | StructPredictionNet | 3.8 M
2 | loss | Loss | 0

xrenaa commented Jul 29, 2024 •

edited

Loading