Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Launch ddp on 8 devices, but only run on the first gpu #16236

Closed
superhero-7 opened this issue Jan 4, 2023 · 18 comments · Fixed by #18137
Closed

Launch ddp on 8 devices, but only run on the first gpu #16236

superhero-7 opened this issue Jan 4, 2023 · 18 comments · Fixed by #18137
Assignees
Labels
strategy: ddp DistributedDataParallel

Comments

@superhero-7
Copy link

superhero-7 commented Jan 4, 2023

Bug description

I train the model like this,there are my code bellow:

trainer_kwargs["accelerator"] = 'gpu'
trainer_kwargs["devices"] = 8
trainer_kwargs["strategy"] = "ddp"
trainer = Trainer.from_argparse_args(trainer_config,**trainer_kwargs)
trainer.fit(model, data)

And it works fine, and didn't drow any error.But it didn't runing on 8 gpus,instead, it only runing on the first gpu.
And only initializing one MEMBER like this:
1672801542612

I am so confuse,beacause the progress bar is totally right.The length of my dataset is 1198099,and in the progress bar, it shows 37457 steps one epoch, I set batch size to 4, so there is 4837457 almost equal to 11198099.
image

But the question is, nvidia-smi only see the first gpu is runing,like bellow:
image

I don't understand why this happend?I hope someone can help me,thanks a lot!!!!!

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0): I try the lastest and 1.7.3, get the same question
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10): 1.12.1 cuda 11.3
#- Python version (e.g., 3.9): 3.8.5
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration: A100*8
#- How you installed Lightning(`conda`, `pip`, source): pip install pytorch_lightning==1.7.3
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @justusschock @awaelchli

@superhero-7 superhero-7 added the needs triage Waiting to be triaged by maintainers label Jan 4, 2023
@dolortaste
Copy link

got the same problem

@superhero-7
Copy link
Author

superhero-7 commented Jan 4, 2023

got the same problem

try

unset KUBERNETES_PORT

it works for me... I spend one night and one morning on it...TT
There is a same problem link:
#5254

@dolortaste
Copy link

unset KUBERNETES_PORT

Solved.. Thx

@awaelchli
Copy link
Contributor

@superhero-7 Unfortunately I don't know how the KUBERNETES_PORT relates to this problem here, or even how it solved it. Does that mean this issue is closed, or are there still some open questions?

@superhero-7
Copy link
Author

@superhero-7 Unfortunately I don't know how the KUBERNETES_PORT relates to this problem here, or even how it solved it. Does that mean this issue is closed, or are there still some open questions?

Our machines are managed by k8s, I suppose maybe there are some conflicts about the GLOBAL RANK enviroment between k8s setting and pytorch_lightning ddp setting?

@Borda Borda added strategy: ddp DistributedDataParallel and removed needs triage Waiting to be triaged by maintainers labels Jan 9, 2023
@magehrig
Copy link

magehrig commented Jan 18, 2023

I got the same issue but on a SLURM cluster. I have access to two SLURM clusters. Interestingly, on one cluster PL DDP works fine but on the second one, I experience this issue. Since I don't use K8s, unset KUBERNETES_PORT does not solve the issue.

I guess it would be really hard to reproduce this. Any pointers to what I could try?

@awaelchli
Copy link
Contributor

You could try printing the os.environ at the beginning of the script and comparing it between the two nodes. See if any env variables are set that shouldn't or ones that are missing. You could also post the printout here if you like (but redact any sensitive information) so we can take a look.

Since you are using SLURM, make sure to follow exactly the instructions here.

@magehrig
Copy link

magehrig commented Jan 18, 2023

@awaelchli Great idea! I think I should have correctly followed the instructions.
Since I use two different (SLURM) clusters they have a slightly different sbatch script but the rest is the same.

For this test, I use two GPUs on a single node.

First sbatch script for the server on which there are no issues:

#!/usr/bin/env bash
#SBATCH --parsable
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=16G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:2
#SBATCH --output=/some/path/%j.out

module load nccl
source /some/path/conda.sh
conda activate myenv
srun python myscript.py ...
conda deactivate

Second sbatch script for the server where I observe the described issue:

#!/usr/bin/env bash
#SBATCH --parsable
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=16G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=3
#SBATCH --gpus=rtx_3090:2
#SBATCH --output=/some/path/%j.out

module load nccl
source /some/path/conda.sh
conda activate myenv
srun python myscript.py ...
conda deactivate

Now, the os.environ output on the server where I observe no issues:

{'ACLOCAL_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/share/aclocal',
 'BASH_ENV': '/cluster/lmod-8.6.5/lmod/lmod/init/bash',
 'BASH_FUNC_ml%%': '() {  eval $($LMOD_DIR/ml_cmd "$@")\n}',
 'BASH_FUNC_module%%': '() {  eval $($LMOD_CMD bash "$@") && eval '
                       '$(${LMOD_SETTARG_CMD:-:} -s sh)\n'
                       '}',
 'CMAKE_PREFIX_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy',
 'CONDA_DEFAULT_ENV': 'rnn-st',
 'CONDA_EXE': '/data/user/programs/mambaforge/bin/conda',
 'CONDA_MKL_INTERFACE_LAYER_BACKUP': '',
 'CONDA_PREFIX': '/data/user/programs/mambaforge/envs/rnn-st',
 'CONDA_PROMPT_MODIFIER': '(rnn-st) ',
 'CONDA_PYTHON_EXE': '/data/user/programs/mambaforge/bin/python',
 'CONDA_SHLVL': '1',
 'CPATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/include',
 'CRC32C_SW_MODE': 'auto',
 'CUDA_DEVICE_ORDER': 'PCI_BUS_ID',
 'CUDA_HOME': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh',
 'CUDA_VISIBLE_DEVICES': '0,1',
 'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/891944109/bus',
 'DISPLAY': 'u20-login-1:16.0',
 'ENVIRONMENT': 'BATCH',
 'GPU_DEVICE_ORDINAL': '0,1',
 'HOME': '/home/user',
 'HOSTNAME': 'u20-computeibmgpu-vesta7',
 'LANG': 'C.UTF-8',
 'LC_ADDRESS': 'de_CH.UTF-8',
 'LC_IDENTIFICATION': 'de_CH.UTF-8',
 'LC_MEASUREMENT': 'de_CH.UTF-8',
 'LC_MONETARY': 'de_CH.UTF-8',
 'LC_NAME': 'de_CH.UTF-8',
 'LC_NUMERIC': 'de_CH.UTF-8',
 'LC_PAPER': 'de_CH.UTF-8',
 'LC_TELEPHONE': 'de_CH.UTF-8',
 'LC_TIME': 'de_CH.UTF-8',
 'LD_LIBRARY_PATH': '/data/user/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/../../lib64:/cluster/munge-0.5.14/lib:/cluster/slurm-20-11-8-1/lib:/cluster/pmix-4.1.2/lib:/cluster/libevent-2.1.12/lib',
 'LESS': '-R',
 'LESSCLOSE': '/usr/bin/lesspipe %s %s',
 'LESSOPEN': '| /usr/bin/lesspipe %s',
 'LIBRARY_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/lib64:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/lib',
 'LMOD_CMD': '/cluster/lmod-8.6.5/lmod/lmod/libexec/lmod',
 'LMOD_COLORIZE': 'yes',
 'LMOD_DIR': '/cluster/lmod-8.6.5/lmod/lmod/libexec',
 'LMOD_FAMILY_GRES': 'v100',
 'LMOD_FAMILY_GRES_VERSION': 'false',
 'LMOD_FAMILY_RESOURCE': 'multigpu',
 'LMOD_FAMILY_RESOURCE_VERSION': 'false',
 'LMOD_FULL_SETTARG_SUPPORT': 'no',
 'LMOD_MODULERCFILE': '/apps/etc/modules/.modulerc.lua',
 'LMOD_PACKAGE_PATH': '/cluster/lmod-8.6.5',
 'LMOD_PKG': '/cluster/lmod-8.6.5/lmod/lmod',
 'LMOD_PREPEND_BLOCK': 'normal',
 'LMOD_ROOT': '/cluster/lmod-8.6.5/lmod',
 'LMOD_SETTARG_CMD': ':',
 'LMOD_SETTARG_FULL_SUPPORT': 'no',
 'LMOD_VERSION': '8.6.5',
 'LMOD_arch': 'x86_64',
 'LMOD_sys': 'Linux',
 'LOADEDMODULES': 'v100:multigpu:libiconv/1.16-pdflaob:xz/5.2.5-mhrz5su:zlib/1.2.12-j4b6zeg:libxml2/2.9.12-koohqap:cuda/11.4.4-ldlywt5:libnl/3.3.0-qtnpjoa:rdma-core/41.0-hquyri7:nccl/2.11.4-1',
 'LOGNAME': 'user',
 'LSCOLORS': 'Gxfxcxdxbxegedabagacad',
 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:',
 'MANPATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/share/man:/cluster/lmod-8.6.5/lmod/lmod/share/man::/var/cfengine/share/man',
 'MKL_INTERFACE_LAYER': 'LP64,GNU',
 'MKL_NUM_THREADS': '1',
 'MODULEPATH': '/apps/etc/modules/multigpu:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core:/apps/etc/modules/system:/apps/etc/modules/containers:/apps/etc/modules/default:/apps/etc/modules/flavors',
 'MODULEPATH_ROOT': '/apps/etc/modules',
 'MODULESHOME': '/cluster/lmod-8.6.5/lmod/lmod',
 'MOTD_SHOWN': 'pam',
 'NUMEXPR_NUM_THREADS': '1',
 'OLDPWD': '/home/user',
 'OMP_NUM_THREADS': '1',
 'OPENBLAS_NUM_THREADS': '1',
 'OPENCV_OPENCL_RUNTIME': 'disabled',
 'PAGER': 'less',
 'PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/bin:/data/user/programs/mambaforge/envs/rnn-st/bin:/data/user/programs/mambaforge/condabin:/cluster/slurm-20-11-8-1/bin:/cluster/slurm-20-11-8-1/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/var/cfengine/bin:/usr/local/go/bin',
 'PKG_CONFIG_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/lib/pkgconfig',
 'PMI_FD': '13',
 'PMI_JOBID': '72348.0',
 'PMI_RANK': '1',
 'PMI_SIZE': '2',
 'PWD': '/data/user/code/rnn-st/scripts/slurm',
 'PYTORCH_NVML_BASED_CUDA_CHECK': '1',
 'QT_QPA_FONTDIR': '/data/user/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/qt/fonts',
 'QT_QPA_PLATFORM_PLUGIN_PATH': '/data/user/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/qt/plugins',
 'ROCR_VISIBLE_DEVICES': '0,1',
 'SACCT_FORMAT': 'jobid%-6,jobname,maxrss,maxvmsize,alloccpus,elapsed%12,state,exitcode%6',
 'SALLOC_CONSTRAINT': 'MULTIGPU',
 'SBATCH_CONSTRAINT': 'MULTIGPU',
 'SHELL': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zsh-5.9-cghjsxkx626zdkbcilxi3tk3nshivvo6/bin/zsh',
 'SHLVL': '3',
 'SLURMD_NODENAME': 'u20-computeibmgpu-vesta7',
 'SLURM_CLUSTER_NAME': 'cluster',
 'SLURM_CONF': '/cluster/slurm-20-11-8-1/etc/slurm.conf',
 'SLURM_CONSTRAINT': 'MULTIGPU',
 'SLURM_CPUS_ON_NODE': '4',
 'SLURM_CPUS_PER_TASK': '2',
 'SLURM_CPU_BIND': 'quiet,mask_cpu:0x020000020000,0x400000400000',
 'SLURM_CPU_BIND_LIST': '0x020000020000,0x400000400000',
 'SLURM_CPU_BIND_TYPE': 'mask_cpu:',
 'SLURM_CPU_BIND_VERBOSE': 'quiet',
 'SLURM_DISTRIBUTION': 'block',
 'SLURM_GTIDS': '0,1',
 'SLURM_JOBID': '72348',
 'SLURM_JOB_ACCOUNT': 'something',
 'SLURM_JOB_CPUS_PER_NODE': '4',
 'SLURM_JOB_CPUS_PER_NODE_PACK_GROUP_0': '4',
 'SLURM_JOB_GID': '891944109',
 'SLURM_JOB_GPUS': '5,6',
 'SLURM_JOB_ID': '72348',
 'SLURM_JOB_NAME': 'train.job',
 'SLURM_JOB_NODELIST': 'u20-computeibmgpu-vesta7',
 'SLURM_JOB_NUM_NODES': '1',
 'SLURM_JOB_PARTITION': 'standard',
 'SLURM_JOB_QOS': 'normal',
 'SLURM_JOB_UID': '891944109',
 'SLURM_JOB_USER': 'user',
 'SLURM_LAUNCH_NODE_IPADDR': '10.129.48.36',
 'SLURM_LOCALID': '1',
 'SLURM_MEM_PER_CPU': '16384',
 'SLURM_MPI_TYPE': 'pmi2',
 'SLURM_NNODES': '1',
 'SLURM_NODEID': '0',
 'SLURM_NODELIST': 'u20-computeibmgpu-vesta7',
 'SLURM_NODE_ALIASES': '(null)',
 'SLURM_NPROCS': '2',
 'SLURM_NTASKS': '2',
 'SLURM_NTASKS_PER_NODE': '2',
 'SLURM_PRIO_PROCESS': '0',
 'SLURM_PROCID': '1',
 'SLURM_SRUN_COMM_HOST': '10.129.48.36',
 'SLURM_SRUN_COMM_PORT': '39247',
 'SLURM_STEPID': '0',
 'SLURM_STEP_GPUS': '5,6',
 'SLURM_STEP_ID': '0',
 'SLURM_STEP_LAUNCHER_PORT': '39247',
 'SLURM_STEP_NODELIST': 'u20-computeibmgpu-vesta7',
 'SLURM_STEP_NUM_NODES': '1',
 'SLURM_STEP_NUM_TASKS': '2',
 'SLURM_STEP_RESV_PORTS': '12585-12587',
 'SLURM_STEP_TASKS_PER_NODE': '2',
 'SLURM_SUBMIT_DIR': '/data/user/code/rnn-st/scripts/slurm',
 'SLURM_SUBMIT_HOST': 'u20-computeibmgpu-vesta7',
 'SLURM_TASKS_PER_NODE': '2',
 'SLURM_TASK_PID': '577439',
 'SLURM_TOPOLOGY_ADDR': 'u20-computeibmgpu-vesta7',
 'SLURM_TOPOLOGY_ADDR_PATTERN': 'node',
 'SLURM_UMASK': '0002',
 'SLURM_WORKING_CLUSTER': 'cluster:u20-controller.hydra:6817:9216:109',
 'SPACK_ROOT': '/apps',
 'SQUEUE_FORMAT': '%8i %7u %12T %.3C %.6m %.12M %20e %R',
 'SRUN_DEBUG': '3',
 'SSH_AGENT_PID': '20309',
 'SSH_AUTH_SOCK': '/tmp/ssh-TLK7wup2nTiA/agent.20307',
 'SSH_CLIENT': '195.176.113.242 35588 22',
 'SSH_CONNECTION': '195.176.113.235 32866 172.16.0.75 22',
 'SSH_TTY': '/dev/pts/0',
 'TERM': 'tmux-256color',
 'TMPDIR': '/data/user/tmp/72348',
 'TMUX': '/tmp//tmux-891944109/default,132182,0',
 'TMUX_PANE': '%41',
 'TMUX_PLUGIN_MANAGER_PATH': '/home/user/.tmux/plugins/',
 'USER': 'user',
 'VECLIB_MAXIMUM_THREADS': '1',
 'WANDB_REQUIRE_SERVICE': 'True',
 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop',
 'XDG_RUNTIME_DIR': '/run/user/891944109',
 'XDG_SESSION_CLASS': 'user',
 'XDG_SESSION_ID': '511',
 'XDG_SESSION_TYPE': 'tty',
 'ZSH': '/home/user/.myconfig/zsh/oh-my-zsh',
 'ZSH_TMUX_CONFIG': '/home/user/.tmux.conf',
 'ZSH_TMUX_TERM': 'screen-256color',
 '_': '/cluster/slurm-20-11-8-1/bin/srun',
 '_CE_CONDA': '',
 '_CE_M': '',
 '_LMFILES_': '/apps/etc/modules/flavors/v100.lua:/apps/etc/modules/flavors/multigpu.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libiconv/1.16-pdflaob.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/xz/5.2.5-mhrz5su.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/zlib/1.2.12-j4b6zeg.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libxml2/2.9.12-koohqap.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/cuda/11.4.4-ldlywt5.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libnl/3.3.0-qtnpjoa.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/rdma-core/41.0-hquyri7.lua:/apps/etc/modules/multigpu/nccl/2.11.4-1.lua',
 '_ModuleTable001_': 'X01vZHVsZVRhYmxlXyA9IHsKTVR2ZXJzaW9uID0gMywKY19yZWJ1aWxkVGltZSA9IDcyMDAuMCwKY19zaG9ydFRpbWUgPSAwLjM5MTMyNDk5Njk0ODI0LApkZXB0aFQgPSB7fSwKZmFtaWx5ID0gewpncmVzID0gInYxMDAiLApyZXNvdXJjZSA9ICJtdWx0aWdwdSIsCn0sCm1UID0gewpjdWRhID0gewpmbiA9ICIvYXBwcy9zaGFyZS9zcGFjay9sbW9kL2xpbnV4LXVidW50dTIwLjA0LXg4Nl82NC9Db3JlL2N1ZGEvMTEuNC40LWxkbHl3dDUubHVhIiwKZnVsbE5hbWUgPSAiY3VkYS8xMS40LjQtbGRseXd0NSIsCmxvYWRPcmRlciA9IDcsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAiY3VkYS8xMS40LjQtbGRseXd0NSIs',
 '_ModuleTable002_': 'CndWID0gIjAwMDAwMDAxMS4wMDAwMDAwMDQuMDAwMDAwMDA0LipsZGx5d3QuMDAwMDAwMDA1Lip6ZmluYWwiLAp9LApsaWJpY29udiA9IHsKZm4gPSAiL2FwcHMvc2hhcmUvc3BhY2svbG1vZC9saW51eC11YnVudHUyMC4wNC14ODZfNjQvQ29yZS9saWJpY29udi8xLjE2LXBkZmxhb2IubHVhIiwKZnVsbE5hbWUgPSAibGliaWNvbnYvMS4xNi1wZGZsYW9iIiwKbG9hZE9yZGVyID0gMywKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJsaWJpY29udi8xLjE2LXBkZmxhb2IiLAp3ViA9ICIwMDAwMDAwMDEuMDAwMDAwMDE2LipkZmxhb2IuKnpmaW5hbCIsCn0sCmxpYm5sID0gewpmbiA9ICIvYXBwcy9zaGFyZS9zcGFjay9s',
 '_ModuleTable003_': 'bW9kL2xpbnV4LXVidW50dTIwLjA0LXg4Nl82NC9Db3JlL2xpYm5sLzMuMy4wLXF0bnBqb2EubHVhIiwKZnVsbE5hbWUgPSAibGlibmwvMy4zLjAtcXRucGpvYSIsCmxvYWRPcmRlciA9IDgsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAibGlibmwvMy4zLjAtcXRucGpvYSIsCndWID0gIjAwMDAwMDAwMy4wMDAwMDAwMDMuKnF0bnBqb2EuKnpmaW5hbCIsCn0sCmxpYnhtbDIgPSB7CmZuID0gIi9hcHBzL3NoYXJlL3NwYWNrL2xtb2QvbGludXgtdWJ1bnR1MjAuMDQteDg2XzY0L0NvcmUvbGlieG1sMi8yLjkuMTIta29vaHFhcC5sdWEiLApmdWxsTmFtZSA9ICJsaWJ4bWwyLzIuOS4xMi1rb29ocWFwIiwKbG9hZE9yZGVy',
 '_ModuleTable004_': 'ID0gNiwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJsaWJ4bWwyLzIuOS4xMi1rb29ocWFwIiwKd1YgPSAiMDAwMDAwMDAyLjAwMDAwMDAwOS4wMDAwMDAwMTIuKmtvb2hxYXAuKnpmaW5hbCIsCn0sCm11bHRpZ3B1ID0gewpmbiA9ICIvYXBwcy9ldGMvbW9kdWxlcy9mbGF2b3JzL211bHRpZ3B1Lmx1YSIsCmZ1bGxOYW1lID0gIm11bHRpZ3B1IiwKbG9hZE9yZGVyID0gMiwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJtdWx0aWdwdSIsCndWID0gIk0uKnpmaW5hbCIsCn0sCm5jY2wgPSB7CmZuID0gIi9hcHBzL2V0Yy9tb2R1bGVzL211bHRpZ3B1L25jY2wv',
 '_ModuleTable005_': 'Mi4xMS40LTEubHVhIiwKZnVsbE5hbWUgPSAibmNjbC8yLjExLjQtMSIsCmxvYWRPcmRlciA9IDEwLApwcm9wVCA9IHt9LApzdGFja0RlcHRoID0gMCwKc3RhdHVzID0gImFjdGl2ZSIsCnVzZXJOYW1lID0gIm5jY2wiLAp3ViA9ICIwMDAwMDAwMDIuMDAwMDAwMDExLjAwMDAwMDAwNC4qemZpbmFsLS4wMDAwMDAwMDEuKnpmaW5hbCIsCn0sClsicmRtYS1jb3JlIl0gPSB7CmZuID0gIi9hcHBzL3NoYXJlL3NwYWNrL2xtb2QvbGludXgtdWJ1bnR1MjAuMDQteDg2XzY0L0NvcmUvcmRtYS1jb3JlLzQxLjAtaHF1eXJpNy5sdWEiLApmdWxsTmFtZSA9ICJyZG1hLWNvcmUvNDEuMC1ocXV5cmk3IiwKbG9hZE9yZGVyID0gOSwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1',
 '_ModuleTable006_': 'cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJyZG1hLWNvcmUvNDEuMC1ocXV5cmk3IiwKd1YgPSAiMDAwMDAwMDQxLipocXV5cmkuMDAwMDAwMDA3Lip6ZmluYWwiLAp9LAp2MTAwID0gewpmbiA9ICIvYXBwcy9ldGMvbW9kdWxlcy9mbGF2b3JzL3YxMDAubHVhIiwKZnVsbE5hbWUgPSAidjEwMCIsCmxvYWRPcmRlciA9IDEsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAidjEwMCIsCndWID0gIk0uKnpmaW5hbCIsCn0sCnh6ID0gewpmbiA9ICIvYXBwcy9zaGFyZS9zcGFjay9sbW9kL2xpbnV4LXVidW50dTIwLjA0LXg4Nl82NC9Db3JlL3h6LzUuMi41LW1ocno1c3UubHVhIiwKZnVsbE5hbWUgPSAieHovNS4yLjUtbWhy',
 '_ModuleTable007_': 'ejVzdSIsCmxvYWRPcmRlciA9IDQsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAieHovNS4yLjUtbWhyejVzdSIsCndWID0gIjAwMDAwMDAwNS4wMDAwMDAwMDIuMDAwMDAwMDA1LiptaHJ6LjAwMDAwMDAwNS4qc3UuKnpmaW5hbCIsCn0sCnpsaWIgPSB7CmZuID0gIi9hcHBzL3NoYXJlL3NwYWNrL2xtb2QvbGludXgtdWJ1bnR1MjAuMDQteDg2XzY0L0NvcmUvemxpYi8xLjIuMTItajRiNnplZy5sdWEiLApmdWxsTmFtZSA9ICJ6bGliLzEuMi4xMi1qNGI2emVnIiwKbG9hZE9yZGVyID0gNSwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJ6bGliLzEuMi4xMi1q',
 '_ModuleTable008_': 'NGI2emVnIiwKd1YgPSAiMDAwMDAwMDAxLjAwMDAwMDAwMi4wMDAwMDAwMTIuKmouMDAwMDAwMDA0LipiLjAwMDAwMDAwNi4qemVnLip6ZmluYWwiLAp9LAp9LAptcGF0aEEgPSB7CiIvYXBwcy9ldGMvbW9kdWxlcy9tdWx0aWdwdSIKLCAiL2FwcHMvc2hhcmUvc3BhY2svbG1vZC9saW51eC11YnVudHUyMC4wNC14ODZfNjQvQ29yZSIKLCAiL2FwcHMvZXRjL21vZHVsZXMvc3lzdGVtIiwgIi9hcHBzL2V0Yy9tb2R1bGVzL2NvbnRhaW5lcnMiCiwgIi9hcHBzL2V0Yy9tb2R1bGVzL2RlZmF1bHQiLCAiL2FwcHMvZXRjL21vZHVsZXMvZmxhdm9ycyIsCn0sCnN5c3RlbUJhc2VNUEFUSCA9ICIvYXBwcy9zaGFyZS9zcGFjay9sbW9kL2xpbnV4LXVidW50dTIwLjA0LXg4Nl82NC9Db3Jl',
 '_ModuleTable009_': 'Oi9hcHBzL2V0Yy9tb2R1bGVzL3N5c3RlbTovYXBwcy9ldGMvbW9kdWxlcy9jb250YWluZXJzOi9hcHBzL2V0Yy9tb2R1bGVzL2RlZmF1bHQ6L2FwcHMvZXRjL21vZHVsZXMvZmxhdm9ycyIsCn0K',
 '_ModuleTable_Sz_': '9',
 '_ZSH_TMUX_FIXED_CONFIG': '/home/user/.myconfig/zsh/oh-my-zsh/plugins/tmux/tmux.extra.conf',
 '__LMOD_REF_COUNT_ACLOCAL_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/share/aclocal:2',
 '__LMOD_REF_COUNT_CMAKE_PREFIX_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy:2',
 '__LMOD_REF_COUNT_CPATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/include:1',
 '__LMOD_REF_COUNT_LIBRARY_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/lib64:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/lib:1',
 '__LMOD_REF_COUNT_LOADEDMODULES': 'v100:1;multigpu:1;libiconv/1.16-pdflaob:1;xz/5.2.5-mhrz5su:1;zlib/1.2.12-j4b6zeg:1;libxml2/2.9.12-koohqap:1;cuda/11.4.4-ldlywt5:1;libnl/3.3.0-qtnpjoa:1;rdma-core/41.0-hquyri7:1;nccl/2.11.4-1:1',
 '__LMOD_REF_COUNT_MANPATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/share/man:2;/cluster/lmod-8.6.5/lmod/lmod/share/man:1;/var/cfengine/share/man:1',
 '__LMOD_REF_COUNT_MODULEPATH': '/apps/etc/modules/multigpu:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core:1;/apps/etc/modules/system:1;/apps/etc/modules/containers:1;/apps/etc/modules/default:1;/apps/etc/modules/flavors:1',
 '__LMOD_REF_COUNT_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/bin:2;/data/user/programs/mambaforge/envs/rnn-st/bin:1;/data/user/programs/mambaforge/condabin:1;/cluster/slurm-20-11-8-1/bin:1;/cluster/slurm-20-11-8-1/sbin:1;/usr/local/sbin:1;/usr/local/bin:1;/usr/sbin:1;/usr/bin:1;/sbin:1;/bin:1;/usr/games:1;/usr/local/games:1;/snap/bin:1;/var/cfengine/bin:1;/usr/local/go/bin:3',
 '__LMOD_REF_COUNT_PKG_CONFIG_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/lib/pkgconfig:2',
 '__LMOD_REF_COUNT__LMFILES_': '/apps/etc/modules/flavors/v100.lua:1;/apps/etc/modules/flavors/multigpu.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libiconv/1.16-pdflaob.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/xz/5.2.5-mhrz5su.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/zlib/1.2.12-j4b6zeg.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libxml2/2.9.12-koohqap.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/cuda/11.4.4-ldlywt5.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libnl/3.3.0-qtnpjoa.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/rdma-core/41.0-hquyri7.lua:1;/apps/etc/modules/multigpu/nccl/2.11.4-1.lua:1',
 '__LMOD_SET_FPATH': '1',
 'ftp_proxy': 'http://wtp.hydra:8080',
 'http_proxy': 'http://wtp.hydra:8080',
 'https_proxy': 'http://wtp.hydra:8080',
 'no_proxy': 'localhost,127.0.0.1,10.129.60.84,.hydra,.int,',
 'tmux_version': '3.0'}

The os.environ output on the server where I observe the described issue:

{'BASH_ENV': '/cluster/apps/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/init/bash',
 'BASH_FUNC_ml%%': '() {  eval $($LMOD_DIR/ml_cmd "$@")\n}',
 'BASH_FUNC_ml()': '() {  eval $($LMOD_DIR/ml_cmd "$@")\n}',
 'BASH_FUNC_module%%': '() {  eval $($LMOD_CMD bash "$@") && eval '
                       '$(${LMOD_SETTARG_CMD:-:} -s sh)\n'
                       '}',
 'BASH_FUNC_module()': '() {  eval $($LMOD_CMD bash "$@") && eval '
                       '$(${LMOD_SETTARG_CMD:-:} -s sh)\n'
                       '}',
 'CC': '/usr/bin/gcc',
 'CMAKE_PREFIX_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3:/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa',
 'CONDA_DEFAULT_ENV': 'rnn-st',
 'CONDA_EXE': '/cluster/project/lab/me/programs/mambaforge/bin/conda',
 'CONDA_MKL_INTERFACE_LAYER_BACKUP': '',
 'CONDA_PREFIX': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st',
 'CONDA_PROMPT_MODIFIER': '(rnn-st) ',
 'CONDA_PYTHON_EXE': '/cluster/project/lab/me/programs/mambaforge/bin/python',
 'CONDA_SHLVL': '1',
 'CONSUL_HTTP_ADDR': 'unix:///var/run/consul/http.sock',
 'CPATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/include',
 'CPP': '/usr/bin/cpp',
 'CRC32C_SW_MODE': 'auto',
 'CUDA_DEVICE_ORDER': 'PCI_BUS_ID',
 'CUDA_VISIBLE_DEVICES': '0,1',
 'CXX': '/usr/bin/g++',
 'DISPLAY': 'localhost:11.0',
 'ENVIRONMENT': 'BATCH',
 'F77': '/usr/bin/gfortran',
 'F90': '/usr/bin/gfortran',
 'FC': '/usr/bin/gfortran',
 'HISTCONTROL': 'ignoredups',
 'HISTSIZE': '50000',
 'HOME': '/cluster/home/user',
 'HOSTNAME': 'eu-g4-015',
 'I_MPI_PMI_LIBRARY': '/cluster/apps/slurm/lib/libpmi2.so',
 'LANG': 'en_US.UTF-8',
 'LD_LIBRARY_PATH': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/../../lib64:/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/lib:/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/lib::',
 'LESS': '-R',
 'LESSOPEN': '||/usr/bin/lesspipe.sh %s',
 'LIBGL_ALWAYS_INDIRECT': '1',
 'LIBRARY_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/lib',
 'LMOD_CMD': '/cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/libexec/lmod',
 'LMOD_DIR': '/cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/libexec',
 'LMOD_FAMILY_COMPILER': 'gcc',
 'LMOD_FAMILY_COMPILER_VERSION': '4.8.5',
 'LMOD_PKG': '/cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod',
 'LMOD_SETTARG_FULL_SUPPORT': 'no',
 'LMOD_SYSTEM_DEFAULT_MODULES': 'StdEnv:gcc/4.8.5',
 'LMOD_VERSION': '7.7.13',
 'LMOD_sys': 'Linux',
 'LOADEDMODULES': 'StdEnv:gcc/4.8.5:zsh/5.8:tmux/3.2a:proxy:nccl/2.11.4-1',
 'LOGNAME': 'user',
 'LSCOLORS': 'Gxfxcxdxbxegedabagacad',
 'LSF_BINDIR': '/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/bin',
 'LSF_ENVDIR': '/cluster/apps/lsf/conf',
 'LSF_LIBDIR': '/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/lib',
 'LSF_SERVERDIR': '/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/etc',
 'LS_COLORS': 'rs=0:di=38;5;27:ln=38;5;51:mh=44;38;5;15:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=05;48;5;232;38;5;15:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;34:*.tar=38;5;9:*.tgz=38;5;9:*.arc=38;5;9:*.arj=38;5;9:*.taz=38;5;9:*.lha=38;5;9:*.lz4=38;5;9:*.lzh=38;5;9:*.lzma=38;5;9:*.tlz=38;5;9:*.txz=38;5;9:*.tzo=38;5;9:*.t7z=38;5;9:*.zip=38;5;9:*.z=38;5;9:*.Z=38;5;9:*.dz=38;5;9:*.gz=38;5;9:*.lrz=38;5;9:*.lz=38;5;9:*.lzo=38;5;9:*.xz=38;5;9:*.bz2=38;5;9:*.bz=38;5;9:*.tbz=38;5;9:*.tbz2=38;5;9:*.tz=38;5;9:*.deb=38;5;9:*.rpm=38;5;9:*.jar=38;5;9:*.war=38;5;9:*.ear=38;5;9:*.sar=38;5;9:*.rar=38;5;9:*.alz=38;5;9:*.ace=38;5;9:*.zoo=38;5;9:*.cpio=38;5;9:*.7z=38;5;9:*.rz=38;5;9:*.cab=38;5;9:*.jpg=38;5;13:*.jpeg=38;5;13:*.gif=38;5;13:*.bmp=38;5;13:*.pbm=38;5;13:*.pgm=38;5;13:*.ppm=38;5;13:*.tga=38;5;13:*.xbm=38;5;13:*.xpm=38;5;13:*.tif=38;5;13:*.tiff=38;5;13:*.png=38;5;13:*.svg=38;5;13:*.svgz=38;5;13:*.mng=38;5;13:*.pcx=38;5;13:*.mov=38;5;13:*.mpg=38;5;13:*.mpeg=38;5;13:*.m2v=38;5;13:*.mkv=38;5;13:*.webm=38;5;13:*.ogm=38;5;13:*.mp4=38;5;13:*.m4v=38;5;13:*.mp4v=38;5;13:*.vob=38;5;13:*.qt=38;5;13:*.nuv=38;5;13:*.wmv=38;5;13:*.asf=38;5;13:*.rm=38;5;13:*.rmvb=38;5;13:*.flc=38;5;13:*.avi=38;5;13:*.fli=38;5;13:*.flv=38;5;13:*.gl=38;5;13:*.dl=38;5;13:*.xcf=38;5;13:*.xwd=38;5;13:*.yuv=38;5;13:*.cgm=38;5;13:*.emf=38;5;13:*.axv=38;5;13:*.anx=38;5;13:*.ogv=38;5;13:*.ogx=38;5;13:*.aac=38;5;45:*.au=38;5;45:*.flac=38;5;45:*.mid=38;5;45:*.midi=38;5;45:*.mka=38;5;45:*.mp3=38;5;45:*.mpc=38;5;45:*.ogg=38;5;45:*.ra=38;5;45:*.wav=38;5;45:*.axa=38;5;45:*.oga=38;5;45:*.spx=38;5;45:*.xspf=38;5;45:',
 'MAIL': '/var/spool/mail/user',
 'MANPATH': '/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf/share/man:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/share/man:/cluster/apps/sfos/share/man/man1:/cluster/apps/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/share/man:/cluster/apps/lsf/10.1/man::',
 'MKL_INTERFACE_LAYER': 'LP64,GNU',
 'MKL_NUM_THREADS': '1',
 'MODULEPATH': '/cluster/apps/lmodules/Compiler/gcc/4.8.5:/cluster/apps/lmodules/Linux:/cluster/apps/lmodules/Core',
 'MODULEPATH_ROOT': '/cluster/apps/lmodules',
 'MODULESHOME': '/cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod',
 'NCCL_ROOT': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3',
 'NUMEXPR_NUM_THREADS': '1',
 'OLDPWD': '/cluster/project/lab/me/code/rnn-st',
 'OMP_NUM_THREADS': '1',
 'OPENBLAS_NUM_THREADS': '1',
 'OPENCV_OPENCL_RUNTIME': 'disabled',
 'PAGER': 'less',
 'PATH': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/bin:/cluster/project/lab/me/programs/mambaforge/condabin:/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf/bin:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/bin:/cluster/apps/local:/cluster/apps/sfos/bin:/cluster/apps/slurm/bin:/usr/lib64/qt-3.3/bin:/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/cluster/home/user/.local/bin:/cluster/home/user/bin:/usr/local/go/bin:/usr/local/go/bin',
 'PKG_CONFIG_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib/pkgconfig',
 'PMI_FD': '10',
 'PMI_JOBID': '7099018.0',
 'PMI_RANK': '0',
 'PMI_SIZE': '1',
 'PWD': '/cluster/project/lab/me/code/rnn-st/scripts/slurm',
 'PYTORCH_NVML_BASED_CUDA_CHECK': '1',
 'QTDIR': '/usr/lib64/qt-3.3',
 'QTINC': '/usr/lib64/qt-3.3/include',
 'QTLIB': '/usr/lib64/qt-3.3/lib',
 'QT_GRAPHICSSYSTEM_CHECKED': '1',
 'QT_QPA_FONTDIR': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/qt/fonts',
 'QT_QPA_PLATFORM_PLUGIN_PATH': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/qt/plugins',
 'SCRATCH': '/cluster/scratch/user',
 'SHELL': '/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/bin/zsh',
 'SHLVL': '3',
 'SHOST': 'eu-login-41',
 'SLURMD_NODENAME': 'eu-g4-015',
 'SLURM_CLUSTER_NAME': 'cluster',
 'SLURM_CONF': '/cluster/slurm/adm/etc/slurm.conf',
 'SLURM_CPUS_ON_NODE': '3',
 'SLURM_CPUS_PER_TASK': '3',
 'SLURM_CPU_BIND_LIST': '0x0000000000000000000000000000001C',
 'SLURM_CPU_BIND_TYPE': 'mask_cpu:',
 'SLURM_CPU_BIND_VERBOSE': 'quiet',
 'SLURM_CPU_Bwandb: IND': 'quiet,mask_cpu:0x0000000000000000000000000000001C',
 'SLURM_DISTRIBUTION': 'cyclic',
 'SLURM_GPUS': 'nvidia_geforce_rtx_3090:2',
 'SLURM_GPUS_ON_NODE': '2',
 'SLURM_GTIDS': '0',
 'SLURM_JOBID': '7099018',
 'SLURM_JOB_ACCOUNT': 'gpuhe/es_scara',
 'SLURM_JOB_CPUS_PER_NODE': '3',
 'SLURM_JOB_CPUS_PER_NODE_PACK_GROUP_0': '3',
 'SLURM_JOB_GID': '476131',
 'SLURM_JOB_GPUS': '2,3',
 'SLURM_JOB_ID': '7099018',
 'SLURM_JOB_NAME': 'train.job',
 'SLURM_JOB_NODELIST': 'eu-g4-015',
 'SLURM_JOB_NUM_NODES': '1',
 'SLURM_JOB_PARTITION': 'gpuhe.120h',
 'SLURM_JOB_QOS': 'es_scara/gpuhe',
 'SLURM_JOB_UID': '575154',
 'SLURM_JOB_USER': 'user',
 'SLURM_LAUNCH_NODE_IPADDR': '10.205.100.15',
 'SLURM_LOCALID': '0',
 'SLURM_MEM_PER_CPU': '32768',
 'SLURM_MPI_TYPE': 'pmi2',
 'SLURM_NNODES': '1',
 'SLURM_NODEID': '0',
 'SLURM_NODELIST': 'eu-g4-015',
 'SLURM_NODE_ALIASES': '(null)',
 'SLURM_NPROCS': '1',
 'SLURM_NTASKS': '1',
 'SLURM_NTASKS_PER_NODE': '2',
 'SLURM_PRIO_PROCESS': '0',
 'SLURM_PROCID': '0',
 'SLURM_SCRIPT_CONTEXT': 'prolog_task',
 'SLURM_SRUN_COMM_HOST': '10.205.100.15',
 'SLURM_SRUN_COMM_PORT': '40015',
 'SLURM_STEPID': '0',
 'SLURM_STEP_GPUS': '2,3',
 'SLURM_STEP_ID': '0',
 'SLURM_STEP_LAUNCHER_PORT': '40015',
 'SLURM_STEP_NODELIST': 'eu-g4-015',
 'SLURM_STEP_NUM_NODES': '1',
 'SLURM_STEP_NUM_TASKS': '1',
 'SLURM_STEP_TASKS_PER_NODE': '1',
 'SLURM_SUBMIT_DIR': '/cluster/project/lab/me/code/rnn-st/scripts/slurm',
 'SLURM_SUBMIT_HOST': 'eu-login-41',
 'SLURM_TASKS_PER_NODE': '1',
 'SLURM_TASK_PID': '122949',
 'SLURM_TOPOLOGY_ADDR': '.cluster_gpuhe.eu-g4-015',
 'SLURM_TOPOLOGY_ADDR_PATTERN': 'switch.switch.node',
 'SLURM_UMASK': '0027',
 'SLURM_WORKING_CLUSTER': 'cluster:10.205.212.30:6817:9728:109',
 'SRUN_DEBUG': '3',
 'SSH_AGENT_PID': '12110',
 'SSH_AUTH_SOCK': '/tmp/ssh-kaM5RUgK0O4U/agent.12108',
 'SSH_CLIENT': '10.6.209.217 33612 22',
 'SSH_CONNECTION': '10.6.208.201 54404 129.132.93.116 22',
 'SSH_TTY': '/dev/pts/8',
 'TERM': 'tmux-256color',
 'TERM_PROGRAM': 'tmux',
 'TERM_PROGRAM_VERSION': '3.2a',
 'TMOUT': '86400',
 'TMPDIR': '/scratch/tmp.7099018.user',
 'TMUX': '/tmp/tmux-575154/default,12176,0',
 'TMUX_PANE': '%15',
 'TMUX_PLUGIN_MANAGER_PATH': '/cluster/home/user/.tmux/plugins/',
 'TMUX_ROOT': '/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf',
 'USER': 'user',
 'VECLIB_MAXIMUM_THREADS': '1',
 'WANDB_REQUIRE_SERVICE': 'True',
 'XDG_RUNTIME_DIR': '/run/user/575154',
 'XDG_SESSION_ID': '4695',
 'XML_CATALOG_FILES': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/etc/xml/catalog',
 'ZSH': '/cluster/home/user/.myconfig/zsh/oh-my-zsh',
 'ZSH_ROOT': '/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa',
 'ZSH_TMUX_CONFIG': '/cluster/home/user/.tmux.conf',
 'ZSH_TMUX_TERM': 'screen-256color',
 '_': '/cluster/apps/slurm/bin/srun',
 '_CE_CONDA': '',
 '_CE_M': '',
 '_LMFILES_': '/cluster/apps/lmodules/Core/StdEnv.lua:/cluster/apps/lmodules/Core/gcc/4.8.5.lua:/cluster/apps/lmodules/Compiler/gcc/4.8.5/zsh/5.8.lua:/cluster/apps/lmodules/Compiler/gcc/4.8.5/tmux/3.2a.lua:/cluster/apps/lmodules/Core/proxy.lua:/cluster/apps/lmodules/Compiler/gcc/4.8.5/nccl/2.11.4-1.lua',
 '_ModuleTable001_': 'X01vZHVsZVRhYmxlXz17WyJNVHZlcnNpb24iXT0zLFsiY19yZWJ1aWxkVGltZSJdPWZhbHNlLFsiY19zaG9ydFRpbWUiXT1mYWxzZSxkZXB0aFQ9e30sZmFtaWx5PXtbImNvbXBpbGVyIl09ImdjYyIsfSxtVD17U3RkRW52PXtbImZuIl09Ii9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29yZS9TdGRFbnYubHVhIixbImZ1bGxOYW1lIl09IlN0ZEVudiIsWyJsb2FkT3JkZXIiXT0xLHByb3BUPXt9LFsic3RhY2tEZXB0aCJdPTAsWyJzdGF0dXMiXT0iYWN0aXZlIixbInVzZXJOYW1lIl09IlN0ZEVudiIsfSxldGhfcHJveHk9e1siZm4iXT0iL2NsdXN0ZXIvYXBwcy9sbW9kdWxlcy9Db3JlL2V0aF9wcm94eS5sdWEiLFsiZnVsbE5hbWUiXT0iZXRoX3Byb3h5IixbImxvYWRPcmRlciJd',
 '_ModuleTable002_': 'PTUscHJvcFQ9e30sWyJzdGFja0RlcHRoIl09MCxbInN0YXR1cyJdPSJhY3RpdmUiLFsidXNlck5hbWUiXT0iZXRoX3Byb3h5Iix9LGdjYz17WyJmbiJdPSIvY2x1c3Rlci9hcHBzL2xtb2R1bGVzL0NvcmUvZ2NjLzQuOC41Lmx1YSIsWyJmdWxsTmFtZSJdPSJnY2MvNC44LjUiLFsibG9hZE9yZGVyIl09Mixwcm9wVD17fSxbInN0YWNrRGVwdGgiXT0wLFsic3RhdHVzIl09ImFjdGl2ZSIsWyJ1c2VyTmFtZSJdPSJnY2MvNC44LjUiLH0sbmNjbD17WyJmbiJdPSIvY2x1c3Rlci9hcHBzL2xtb2R1bGVzL0NvbXBpbGVyL2djYy80LjguNS9uY2NsLzIuMTEuNC0xLmx1YSIsWyJmdWxsTmFtZSJdPSJuY2NsLzIuMTEuNC0xIixbImxvYWRPcmRlciJdPTYscHJvcFQ9e30sWyJzdGFja0Rl',
 '_ModuleTable003_': 'cHRoIl09MCxbInN0YXR1cyJdPSJhY3RpdmUiLFsidXNlck5hbWUiXT0ibmNjbCIsfSx0bXV4PXtbImZuIl09Ii9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29tcGlsZXIvZ2NjLzQuOC41L3RtdXgvMy4yYS5sdWEiLFsiZnVsbE5hbWUiXT0idG11eC8zLjJhIixbImxvYWRPcmRlciJdPTQscHJvcFQ9e30sWyJzdGFja0RlcHRoIl09MCxbInN0YXR1cyJdPSJhY3RpdmUiLFsidXNlck5hbWUiXT0idG11eCIsfSx6c2g9e1siZm4iXT0iL2NsdXN0ZXIvYXBwcy9sbW9kdWxlcy9Db21waWxlci9nY2MvNC44LjUvenNoLzUuOC5sdWEiLFsiZnVsbE5hbWUiXT0ienNoLzUuOCIsWyJsb2FkT3JkZXIiXT0zLHByb3BUPXt9LFsic3RhY2tEZXB0aCJdPTAsWyJzdGF0dXMiXT0iYWN0aXZlIixb',
 '_ModuleTable004_': 'InVzZXJOYW1lIl09InpzaCIsfSx9LG1wYXRoQT17Ii9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29tcGlsZXIvZ2NjLzQuOC41IiwiL2NsdXN0ZXIvYXBwcy9sbW9kdWxlcy9MaW51eCIsIi9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29yZSIsfSxbInN5c3RlbUJhc2VNUEFUSCJdPSIvY2x1c3Rlci9hcHBzL2xtb2R1bGVzL0xpbnV4Oi9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29yZSIsfQ==',
 '_ModuleTable_Sz_': '4',
 '_ZSH_TMUX_FIXED_CONFIG': '/cluster/home/user/.myconfig/zsh/oh-my-zsh/plugins/tmux/tmux.extra.conf',
 '__Init_Default_Modules': '1',
 '__LMOD_REF_COUNT_CMAKE_PREFIX_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3:1;/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa:1',
 '__LMOD_REF_COUNT_CPATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/include:1',
 '__LMOD_REF_COUNT_LD_LIBRARY_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/lib:1;/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/lib:1',
 '__LMOD_REF_COUNT_LIBRARY_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/lib:1',
 '__LMOD_REF_COUNT_LOADEDMODULES': 'StdEnv:1;gcc/4.8.5:1;zsh/5.8:1;tmux/3.2a:1;proxy:1;nccl/2.11.4-1:1',
 '__LMOD_REF_COUNT_MANPATH': '/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf/share/man:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/share/man:1;/cluster/apps/sfos/share/man/man1:1;/cluster/apps/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/share/man:1;/cluster/apps/lsf/10.1/man:1',
 '__LMOD_REF_COUNT_PATH': '/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf/bin:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/bin:1;/cluster/apps/local:2;/cluster/apps/sfos/bin:1;/cluster/apps/slurm/bin:1;/usr/lib64/qt-3.3/bin:1;/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/bin:1;/usr/local/bin:1;/usr/bin:1;/usr/local/sbin:1;/usr/sbin:1;/cluster/home/user/.local/bin:1;/cluster/home/user/bin:1',
 '__LMOD_REF_COUNT_PKG_CONFIG_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib/pkgconfig:1',
 '__LMOD_REF_COUNT__LMFILES_': '/cluster/apps/lmodules/Core/StdEnv.lua:1;/cluster/apps/lmodules/Core/gcc/4.8.5.lua:1;/cluster/apps/lmodules/Compiler/gcc/4.8.5/zsh/5.8.lua:1;/cluster/apps/lmodules/Compiler/gcc/4.8.5/tmux/3.2a.lua:1;/cluster/apps/lmodules/Core/proxy.lua:1;/cluster/apps/lmodules/Compiler/gcc/4.8.5/nccl/2.11.4-1.lua:1',
 'ftp_proxy': 'http://blabla:3128',
 'http_proxy': 'http://blabla:3128',
 'https_proxy': 'http://blabla:3128',
 'no_proxy': 'api.wandb.ai,app.neptune.ai',
 'tmux_version': '3.2',
 'xml_catalog_files_libxslt': ''}

From a quick scan I see that SLURM_NTASKS is 2 (as expected) on the working server and 1 on the problematic server. I don't know yet why this is the case because I specify the ntasks only in the sbatch script to 2 and nowhere else. Just my first observation so far.

@magehrig
Copy link

I found a workaround. Strangely, when I additionally set ntasks=NUM_GPUS, DDP works as expected. In this case, on the problematic cluster I get SLURM_NTAKS=NUM_GPUS and then the script runs correctly. So the augmented sbatch script is:

#!/usr/bin/env bash
#SBATCH --parsable
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=16G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=3
#SBATCH --gpus=rtx_3090:2
#SBATCH --output=/some/path/%j.out

module load nccl
source /some/path/conda.sh
conda activate myenv
srun python myscript.py ...
conda deactivate

No idea why ntasks-per-node is not sufficient.

@magehrig
Copy link

Got a response from the cluster support. Apparently they still need to configure:
#tasks = #node * #ntasks-per-node.

TLDR: it is a slurm config issue not PL related.

@zhjohnchan
Copy link

For SLURM users (using the interactive mode), this could be an issue.

Try to downgrade the pytorch-lightning: pip install pytorch_lightning==1.7.7.

@felix-ky
Copy link

got the same problem

try

unset KUBERNETES_PORT

it works for me... I spend one night and one morning on it...TT There is a same problem link: #5254

many thanks, it really works!!!

@awaelchli
Copy link
Contributor

@superhero-7 Were you able to resolve the issue on your end? I couldn't figure out whether this is an issue with Lightning or not.

@jasonkena
Copy link

For SLURM users (using the interactive mode), this could be an issue.

Try to downgrade the pytorch-lightning: pip install pytorch_lightning==1.7.7.

I ran into the same issue. Seeing #5225 (comment) and the docs, I solved it by adding os.environ["SLURM_JOB_NAME"]="bash" to my script.

@awaelchli
Copy link
Contributor

@jasonkena That'll work yes. Here is the proper docs link for this. The other users who commented here had an issue with the kubernetes environment variable and I fixed this in the linked PR: #18137

@Master-cai
Copy link

@awaelchli Thanks for your work! I'm using the kubernetes environment and unset KUBERNETES_PORT works for me when I only use one node. However, I need to use multi nodes so I can't do unset KUBERNETES_PORT.
I would like to know which version includes this patch? And i'm using pytorch-lightning 1.9.0 , are there any quick solutions available without upgrading the pytorch-lightning?

@Master-cai
Copy link

reply to myself:
following this PR #18137, I manually modified the two line in the source code, and this works for me.

@phrasenmaeher
Copy link

For those using SLURM, don't forget to use srun python ... instead of plain python ... to start your job (taking into account the previous settings, of course).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strategy: ddp DistributedDataParallel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants