Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory cost #19

Open
VLadImirluren opened this issue Jul 27, 2024 · 10 comments
Open

GPU memory cost #19

VLadImirluren opened this issue Jul 27, 2024 · 10 comments

Comments

@VLadImirluren
Copy link

Could you share the memory cost by different stage and different dataset?

I met the problem that I train fine VAE on 80GB GPU with batch_size=1 will OOM at epoch0

@VLadImirluren
Copy link
Author

I met the problem :
image

I know you have written the code to skip OOM batch
image

BUT it just warning and no error no failed and no terminate and the GPU memory will not released and no move forward...

How should I deal with this problem?
(Except hand craft and I am reproducing the result show in paper so load your ckpt is not a solution too)

@VLadImirluren
Copy link
Author

Finetune on your ckpt still Runtime Error

nohup: ignoring input
2024-07-27 14:39:26.122 | INFO | main::171 - This is train_auto.py! Please note that you should use 300 instead of 300.0 for resuming.
git root error: Cmd('git') failed due to: exit code(128)
cmdline: git rev-parse --show-toplevel
stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube'
To add an exception for this directory, call:

git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'

git root error: Cmd('git') failed due to: exit code(128)
cmdline: git rev-parse --show-toplevel
stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube'
To add an exception for this directory, call:

git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'

wandb: Currently logged in as: 13532152291 (13532152291-sun-yat-sen-university). Use wandb login --relogin to force relogin
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: Tracking run with wandb version 0.17.3
wandb: Run data is saved locally in ../wandb/wandb/run-20240727_143927-afca2fj3
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run chair_VAE_sparse/512_to_128-kld-1.0
wandb: ⭐️ View project at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet
wandb: 🚀 View run at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/afca2fj3
[rank: 0] Global seed set to 0
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2024-07-27 14:39:42.125 | INFO | xcube.modules.autoencoding.sunet:init:241 - latent dim: 8
[rank: 0] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

dk-process-data-master-0:84258:84258 [0] NCCL INFO Bootstrap : Using eth0:172.16.28.236<0>
dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
dk-process-data-master-0:84258:84258 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
dk-process-data-master-0:84258:84258 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda12.3
Restoring states from the checkpoint path at /mnt/pfs/users/dengken/code/XCube/checkpoints/chair_download/fine_vae/last.ckpt
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1452: UserWarning: Be aware that when using ckpt_path, callbacks used to create the checkpoint need to be provided during Trainer instantiation. Please add the following callbacks: ["ModelCheckpoint{'monitor': 'val_step', 'mode': 'max', 'every_n_train_steps': 5000, 'every_n_epochs': 0, 'train_time_interval': None}"].
rank_zero_warn(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | encoder | Encoder | 1.2 K
1 | unet | StructPredictionNet | 3.8 M
2 | loss | Loss | 0

3.8 M Trainable params
0 Non-trainable params
3.8 M Total params
15.203 Total estimated model params size (MB)
Restored all states from the checkpoint file at /mnt/pfs/users/dengken/code/XCube/checkpoints/chair_download/fine_vae/last.ckpt

======= MODEL HYPER-PARAMETERS ======= <<<<
exec: null
include: null
test_set_shuffle: false
batch_size: 1
accumulate_grad_batches: 32
visualize: false
name: shapenet/chair_VAE_sparse
model: autoencoder
tree_depth: 3
voxel_size:

  • 0.0025
  • 0.0025
  • 0.0025
    resolution: 512
    use_fvdb_loader: true
    use_hash_tree: true
    use_input_normal: true
    use_input_semantic: false
    use_input_intensity: false
    cut_ratio: 16
    kl_weight: 1.0
    normalize_kld: true
    enable_anneal: false
    kl_weight_min: 1.0e-07
    kl_weight_max: 1.0
    anneal_star_iter: 0
    anneal_end_iter: 70000
    supervision:
    structure_weight: 20.0
    normal_weight: 300.0
    color_weight: 0.0
    semantic_weight: 0.0
    optimizer: Adam
    learning_rate:
    init: 0.0001
    decay_mult: 0.7
    decay_step: 50000
    clip: 1.0e-06
    weight_decay: 0.0
    grad_clip: 0.5
    network:
    encoder:
    c_dim: 32
    unet:
    target: StructPredictionNet
    params:
    in_channels: 32
    num_blocks: 3
    f_maps: 32
    neck_dense_type: UNCHANGED
    neck_bound:
    - 64
    - 64
    - 64
    num_res_blocks: 1
    use_residual: false
    order: gcr
    is_add_dec: false
    use_attention: false
    use_checkpoint: false
    _shapenet_path: ../data/shapenet/
    _shapenet_categories:
  • '03001627'
    _shapenet_custom_name: shapenet
    train_dataset: ShapeNetDataset
    train_val_num_workers: 0
    train_kwargs:
    onet_base_path: ../data/shapenet/
    resolution: 512
    categories:
    • '03001627'
      custom_name: shapenet
      split: train
      random_seed: 0
      val_dataset: ShapeNetDataset
      val_kwargs:
      onet_base_path: ../data/shapenet/
      resolution: 512
      categories:
    • '03001627'
      custom_name: shapenet
      split: val
      random_seed: fixed
      test_dataset: ShapeNetDataset
      test_num_workers: 0
      test_kwargs:
      onet_base_path: ../data/shapenet/
      resolution: 512
      categories:
    • '03001627'
      custom_name: shapenet
      split: test
      random_seed: fixed
      remain_h: false
      pretrained_weight: null
      use_input_color: false
      with_color_branch: false
      with_normal_branch: true
      with_semantic_branch: false

====================================== <<<<

Sanity Checking: 0it [00:00, ?it/s]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 128 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
rank_zero_warn(
dk-process-data-master-0:84258:86092 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
dk-process-data-master-0:84258:86092 [0] NCCL INFO P2P plugin IBext
dk-process-data-master-0:84258:86092 [0] NCCL INFO NET/IB : No device found.
dk-process-data-master-0:84258:86092 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dk-process-data-master-0:84258:86092 [0] NCCL INFO NET/IB : No device found.
dk-process-data-master-0:84258:86092 [0] NCCL INFO NET/Socket : Using [0]eth0:172.16.28.236<0>
dk-process-data-master-0:84258:86092 [0] NCCL INFO Using non-device net plugin version 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Using network Socket
dk-process-data-master-0:84258:86092 [0] NCCL INFO comm 0x561eb43b3e00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId ad000 commId 0x68b3dc29606196e0 - Init START
dk-process-data-master-0:84258:86092 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
dk-process-data-master-0:84258:86092 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff,00000000
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 00/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 01/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 02/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 03/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 04/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 05/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 06/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 07/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 08/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 09/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 10/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 11/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 12/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 13/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 14/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 15/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 16/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 17/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 18/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 19/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 20/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 21/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 22/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 23/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 24/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 25/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 26/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 27/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 28/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 29/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 30/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Channel 31/32 : 0
dk-process-data-master-0:84258:86092 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
dk-process-data-master-0:84258:86092 [0] NCCL INFO P2P Chunksize set to 131072
dk-process-data-master-0:84258:86092 [0] NCCL INFO Connected all rings
dk-process-data-master-0:84258:86092 [0] NCCL INFO Connected all trees
dk-process-data-master-0:84258:86092 [0] NCCL INFO 32 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dk-process-data-master-0:84258:86092 [0] NCCL INFO comm 0x561eb43b3e00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId ad000 commId 0x68b3dc29606196e0 - Init COMPLETE

Sanity Checking: 0%| | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/data.py:84: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 1016724. To avoid any miscalculations, use self.log(..., batch_size=batch_size).
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called self.log('val_step', ...) in your validation_step but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(

Sanity Checking DataLoader 0: 50%|█████ | 1/2 [00:02<00:02, 2.27s/it]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/data.py:84: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 507080. To avoid any miscalculations, use self.log(..., batch_size=batch_size).
warning_cache.warn(

Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:02<00:00, 1.32s/it]/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_metric/struct-acc-2', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_metric/struct-acc-1', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_metric/struct-acc-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/struct-2', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/struct-1', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/struct-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/normal', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/kld', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/mu-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/logvar-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/kld-true-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss/kld-total-0', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_loss', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use self.log('val_step', ..., sync_dist=True) when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(

/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 128 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
rank_zero_warn(

Training: 594it [00:00, ?it/s]
Training: 0%| | 0/6271 [00:00<00:00, -20590219.64it/s]
Epoch 100: 0%| | 0/6271 [00:00<?, ?it/s] /root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called self.log('val_step', ...) in your training_step but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(

Epoch 100: 0%| | 1/6271 [00:01<3:04:46, 1.77s/it]
Epoch 100: 0%| | 1/6271 [00:01<3:04:54, 1.77s/it, loss=18.8, v_num=2fj3]
Epoch 100: 0%| | 2/6271 [00:03<3:27:35, 1.99s/it, loss=18.8, v_num=2fj3]
Epoch 100: 0%| | 2/6271 [00:03<3:27:39, 1.99s/it, loss=26.3, v_num=2fj3]
Epoch 100: 0%| | 3/6271 [00:04<2:45:00, 1.58s/it, loss=26.3, v_num=2fj3]
Epoch 100: 0%| | 3/6271 [00:04<2:45:02, 1.58s/it, loss=23.6, v_num=2fj3]
Epoch 100: 0%| | 4/6271 [00:05<2:14:35, 1.29s/it, loss=23.6, v_num=2fj3]
Epoch 100: 0%| | 4/6271 [00:05<2:14:37, 1.29s/it, loss=23.9, v_num=2fj3]
Epoch 100: 0%| | 5/6271 [00:05<1:56:01, 1.11s/it, loss=23.9, v_num=2fj3]
Epoch 100: 0%| | 5/6271 [00:05<1:56:03, 1.11s/it, loss=23.5, v_num=2fj3]
Epoch 100: 0%| | 6/6271 [00:07<2:12:47, 1.27s/it, loss=23.5, v_num=2fj3]
Epoch 100: 0%| | 6/6271 [00:07<2:12:48, 1.27s/it, loss=24.2, v_num=2fj3]
Epoch 100: 0%| | 7/6271 [00:08<2:06:06, 1.21s/it, loss=24.2, v_num=2fj3]
Epoch 100: 0%| | 7/6271 [00:08<2:06:07, 1.21s/it, loss=24.2, v_num=2fj3]
Epoch 100: 0%| | 8/6271 [00:08<1:56:14, 1.11s/it, loss=24.2, v_num=2fj3]
Epoch 100: 0%| | 8/6271 [00:08<1:56:15, 1.11s/it, loss=23.1, v_num=2fj3]
Epoch 100: 0%| | 9/6271 [00:09<1:52:52, 1.08s/it, loss=23.1, v_num=2fj3]
Epoch 100: 0%| | 9/6271 [00:09<1:52:53, 1.08s/it, loss=23, v_num=2fj3]
Epoch 100: 0%| | 10/6271 [00:10<1:52:31, 1.08s/it, loss=23, v_num=2fj3]
Epoch 100: 0%| | 10/6271 [00:10<1:52:31, 1.08s/it, loss=22.9, v_num=2fj3]
Epoch 100: 0%| | 11/6271 [00:11<1:48:02, 1.04s/it, loss=22.9, v_num=2fj3]
Epoch 100: 0%| | 11/6271 [00:11<1:48:02, 1.04s/it, loss=22.5, v_num=2fj3]
Epoch 100: 0%| | 12/6271 [00:11<1:42:54, 1.01it/s, loss=22.5, v_num=2fj3]
Epoch 100: 0%| | 12/6271 [00:11<1:42:54, 1.01it/s, loss=23, v_num=2fj3]
Epoch 100: 0%| | 13/6271 [00:12<1:39:30, 1.05it/s, loss=23, v_num=2fj3]
Epoch 100: 0%| | 13/6271 [00:12<1:39:30, 1.05it/s, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 14/6271 [00:13<1:41:52, 1.02it/s, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 14/6271 [00:13<1:41:52, 1.02it/s, loss=22.8, v_num=2fj3]
Epoch 100: 0%| | 15/6271 [00:14<1:38:20, 1.06it/s, loss=22.8, v_num=2fj3]
Epoch 100: 0%| | 15/6271 [00:14<1:38:21, 1.06it/s, loss=22.2, v_num=2fj3]
Epoch 100: 0%| | 16/6271 [00:15<1:42:53, 1.01it/s, loss=22.2, v_num=2fj3]
Epoch 100: 0%| | 16/6271 [00:15<1:42:53, 1.01it/s, loss=23.2, v_num=2fj3]
Epoch 100: 0%| | 17/6271 [00:16<1:40:30, 1.04it/s, loss=23.2, v_num=2fj3]
Epoch 100: 0%| | 17/6271 [00:16<1:40:30, 1.04it/s, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 18/6271 [00:18<1:45:25, 1.01s/it, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 18/6271 [00:18<1:45:25, 1.01s/it, loss=23.6, v_num=2fj3]
Epoch 100: 0%| | 19/6271 [00:19<1:47:14, 1.03s/it, loss=23.6, v_num=2fj3]
Epoch 100: 0%| | 19/6271 [00:19<1:47:15, 1.03s/it, loss=23.5, v_num=2fj3]
Epoch 100: 0%| | 20/6271 [00:20<1:44:26, 1.00s/it, loss=23.5, v_num=2fj3]
Epoch 100: 0%| | 20/6271 [00:20<1:44:26, 1.00s/it, loss=23.4, v_num=2fj3]
Epoch 100: 0%| | 21/6271 [00:21<1:44:16, 1.00s/it, loss=23.4, v_num=2fj3]
Epoch 100: 0%| | 21/6271 [00:21<1:44:17, 1.00s/it, loss=23.3, v_num=2fj3]
Epoch 100: 0%| | 22/6271 [00:22<1:45:58, 1.02s/it, loss=23.3, v_num=2fj3]
Epoch 100: 0%| | 22/6271 [00:22<1:45:59, 1.02s/it, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 23/6271 [00:22<1:43:30, 1.01it/s, loss=22.7, v_num=2fj3]
Epoch 100: 0%| | 23/6271 [00:22<1:43:30, 1.01it/s, loss=22.8, v_num=2fj3]
Epoch 100: 0%| | 24/6271 [00:24<1:44:31, 1.00s/it, loss=22.8, v_num=2fj3]
Epoch 100: 0%| | 24/6271 [00:24<1:44:31, 1.00s/it, loss=23.1, v_num=2fj3]
Epoch 100: 0%| | 25/6271 [00:24<1:42:20, 1.02it/s, loss=23.1, v_num=2fj3]
Epoch 100: 0%| | 25/6271 [00:24<1:42:21, 1.02it/s, loss=22.9, v_num=2fj3]
Epoch 100: 0%| | 26/6271 [00:25<1:40:29, 1.04it/s, loss=22.9, v_num=2fj3]
Epoch 100: 0%| | 26/6271 [00:25<1:40:30, 1.04it/s, loss=22.6, v_num=2fj3]
Epoch 100: 0%| | 27/6271 [00:25<1:38:16, 1.06it/s, loss=22.6, v_num=2fj3]
Epoch 100: 0%| | 27/6271 [00:25<1:38:16, 1.06it/s, loss=22.5, v_num=2fj3]
Epoch 100: 0%| | 28/6271 [00:26<1:36:59, 1.07it/s, loss=22.5, v_num=2fj3]
Epoch 100: 0%| | 28/6271 [00:26<1:36:59, 1.07it/s, loss=22.6, v_num=2fj3]
Epoch 100: 0%| | 29/6271 [00:26<1:35:29, 1.09it/s, loss=22.6, v_num=2fj3]
Epoch 100: 0%| | 29/6271 [00:26<1:35:30, 1.09it/s, loss=22.2, v_num=2fj3]
Epoch 100: 0%| | 30/6271 [00:27<1:34:08, 1.10it/s, loss=22.2, v_num=2fj3]
Epoch 100: 0%| | 30/6271 [00:27<1:34:08, 1.10it/s, loss=22.3, v_num=2fj3]
Epoch 100: 0%| | 31/6271 [00:29<1:37:20, 1.07it/s, loss=22.3, v_num=2fj3]
Epoch 100: 0%| | 31/6271 [00:29<1:37:20, 1.07it/s, loss=22.8, v_num=2fj3][rank0]:[W reducer.cpp:1360] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Traceback (most recent call last):
File "/mnt/pfs/users/dengken/code/XCube/train.py", line 407, in
trainer.fit(net_model, ckpt_path=last_ckpt_path)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 213, in advance
batch_output = self.batch_loop.run(kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 202, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 249, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 370, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1742, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 280, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
return self.precision_plugin.optimizer_step(
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 119, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
return wrapped(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/optimizer.py", line 385, in wrapper
out = func(*args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/adamw.py", line 187, in step
adamw(
File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/adamw.py", line 339, in adamw
func(
File "/root/miniconda3/lib/python3.9/site-packages/torch/optim/adamw.py", line 549, in _multi_tensor_adamw
torch.foreach_lerp(device_exp_avgs, device_grads, 1 - beta1)
RuntimeError: The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1
Training Finished. Best path = ../wandb/xcube-shapenet/afca2fj3/checkpoints/epoch=000100-step=000029700.ckpt
wandb: - 0.014 MB of 0.014 MB uploaded
wandb: \ 0.019 MB of 0.042 MB uploaded
wandb: | 0.036 MB of 0.042 MB uploaded
wandb: 🚀 View run chair_VAE_sparse/512_to_128-kld-1.0 at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/afca2fj3
wandb: ⭐️ View project at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ../wandb/wandb/run-20240727_143927-afca2fj3/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.
Exception ignored in: <function tqdm.del at 0x7f5980fb2ca0>
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1152, in del
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1306, in close
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1499, in display
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1155, in str
File "/root/miniconda3/lib/python3.9/site-packages/tqdm/std.py", line 1457, in format_dict
TypeError: cannot unpack non-iterable NoneType object
dk-process-data-master-0:84258:86123 [0] NCCL INFO [Service thread] Connection closed by localRank 0
dk-process-data-master-0:84258:84258 [0] NCCL INFO comm 0x561eb43b3e00 rank 0 nranks 1 cudaDev 0 busId ad000 - Abort COMPLETE

@xrenaa
Copy link
Collaborator

xrenaa commented Jul 29, 2024

Hi, could you watch your GPU memory usage by watch nvidia-smi?
Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1?
Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

@VLadImirluren
Copy link
Author

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

I always monitor by "watch -n 0.1 nvidia-smi"

For the first error, I met it many times. After it occur, the program would
"[rank0]:[W CUDAGuardImpl.h:115] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
Aborted (core dumped)",
"watch -n 0.1 nvidia-smi" will stop at
image
and the GPU memory will not released so I need to "pkill -f python" to release it.

For the second problem, I don't what's the reason because I just load the ckpt and finetune it (NO CODE BEING CHANGED)
I don't know why, because the ckpt is given by you and I have no way to make sure the ckpt have any problem in finetuning.

I am not sure it is a OOM problem, because "[rank0]:[W CUDAGuardImpl.h:115] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)" give no information about it, even no log file.
I believe that you have met this problem before, could you please give me some instruction or prompt about it?

At last, I will try the solution you give "remove the sample that triggers OOM.".

Thanks for your reply

@VLadImirluren
Copy link
Author

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

The link to shapenet dataset is empty.
please at least release the list...

@xrenaa
Copy link
Collaborator

xrenaa commented Jul 31, 2024

@VLadImirluren
Copy link
Author

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

I just try your suggestion.
It is not a good idea to deal with this problem.
Because it is not a problem with any sample, it is about the whole batch (batchsize * gradient_accumulation).

@VLadImirluren
Copy link
Author

Could you try https://drive.google.com/file/d/1PQmSomS1B7UR7wNuqp5RtgkdXo7stKzG/view?usp=sharing?

I sent an application, please check it.Thanks

@LeoDarcy
Copy link

LeoDarcy commented Oct 9, 2024

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

I just try your suggestion. It is not a good idea to deal with this problem. Because it is not a problem with any sample, it is about the whole batch (batchsize * gradient_accumulation).

Hi, I have the same problem. Have you solved this problem? I have tried to reduce the batch size and removed some samples, but it doesn't work. It occurs in the second epoch.

@VLadImirluren
Copy link
Author

Hi, could you watch your GPU memory usage by watch nvidia-smi? Since the first error is an illegal memory usage and the second error is The size of tensor a (32) must match the size of tensor b (36) at non-singleton dimension 1? Btw, I am also using 80GB A100. If the problem is truely a OOM issue, I would suggest you remove the sample that triggers OOM.

I just try your suggestion. It is not a good idea to deal with this problem. Because it is not a problem with any sample, it is about the whole batch (batchsize * gradient_accumulation).

Hi, I have the same problem. Have you solved this problem? I have tried to reduce the batch size and removed some samples, but it doesn't work. It occurs in the second epoch.

NO!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants