Training executes first Epoch but then stop itself, how come? #13

vinevix · 2023-01-10T14:52:46Z

No description provided.

buxiangzhiren · 2023-01-10T15:08:07Z

Can you show more details (printed logs) ?

vinevix · 2023-01-10T15:50:35Z

That's output logs:
RANK and WORLD_SIZE in environ: 0/4
RANK and WORLD_SIZE in environ: 1/4
RANK and WORLD_SIZE in environ: 3/4
RANK and WORLD_SIZE in environ: 2/4
Train both prefix and GPT
196,220,948 total parameters
No decay params: ['bos_embedding', 'gpt.transformer.h.0.ln_1.linear.bias', 'gpt.transformer.h.0.attn.c_attn.bias', 'gpt.transformer.h.0.attn.c_proj.bias', 'gpt.transformer.h.0.ln_2.linear.bias', 'gpt.transformer.h.0.crossattention.c_attn.bias', 'gpt.transformer.h.0.crossattention.q_attn.bias', 'gpt.transformer.h.0.crossattention.c_proj.bias', 'gpt.transformer.h.0.ln_cross_attn.linear.bias', 'gpt.transformer.h.0.mlp.c_fc.bias', 'gpt.transformer.h.0.mlp.c_proj.bias', 'gpt.transformer.h.1.ln_1.linear.bias', 'gpt.transformer.h.1.attn.c_attn.bias', 'gpt.transformer.h.1.attn.c_proj.bias', 'gpt.transformer.h.1.ln_2.linear.bias', 'gpt.transformer.h.1.crossattention.c_attn.bias', 'gpt.transformer.h.1.crossattention.q_attn.bias', 'gpt.transformer.h.1.crossattention.c_proj.bias', 'gpt.transformer.h.1.ln_cross_attn.linear.bias', 'gpt.transformer.h.1.mlp.c_fc.bias', 'gpt.transformer.h.1.mlp.c_proj.bias', 'gpt.transformer.h.2.ln_1.linear.bias', 'gpt.transformer.h.2.attn.c_attn.bias', 'gpt.transformer.h.2.attn.c_proj.bias', 'gpt.transformer.h.2.ln_2.linear.bias', 'gpt.transformer.h.2.crossattention.c_attn.bias', 'gpt.transformer.h.2.crossattention.q_attn.bias', 'gpt.transformer.h.2.crossattention.c_proj.bias', 'gpt.transformer.h.2.ln_cross_attn.linear.bias', 'gpt.transformer.h.2.mlp.c_fc.bias', 'gpt.transformer.h.2.mlp.c_proj.bias', 'gpt.transformer.h.3.ln_1.linear.bias', 'gpt.transformer.h.3.attn.c_attn.bias', 'gpt.transformer.h.3.attn.c_proj.bias', 'gpt.transformer.h.3.ln_2.linear.bias', 'gpt.transformer.h.3.crossattention.c_attn.bias', 'gpt.transformer.h.3.crossattention.q_attn.bias', 'gpt.transformer.h.3.crossattention.c_proj.bias', 'gpt.transformer.h.3.ln_cross_attn.linear.bias', 'gpt.transformer.h.3.mlp.c_fc.bias', 'gpt.transformer.h.3.mlp.c_proj.bias', 'gpt.transformer.h.4.ln_1.linear.bias', 'gpt.transformer.h.4.attn.c_attn.bias', 'gpt.transformer.h.4.attn.c_proj.bias', 'gpt.transformer.h.4.ln_2.linear.bias', 'gpt.transformer.h.4.crossattention.c_attn.bias', 'gpt.transformer.h.4.crossattention.q_attn.bias', 'gpt.transformer.h.4.crossattention.c_proj.bias', 'gpt.transformer.h.4.ln_cross_attn.linear.bias', 'gpt.transformer.h.4.mlp.c_fc.bias', 'gpt.transformer.h.4.mlp.c_proj.bias', 'gpt.transformer.h.5.ln_1.linear.bias', 'gpt.transformer.h.5.attn.c_attn.bias', 'gpt.transformer.h.5.attn.c_proj.bias', 'gpt.transformer.h.5.ln_2.linear.bias', 'gpt.transformer.h.5.crossattention.c_attn.bias', 'gpt.transformer.h.5.crossattention.q_attn.bias', 'gpt.transformer.h.5.crossattention.c_proj.bias', 'gpt.transformer.h.5.ln_cross_attn.linear.bias', 'gpt.transformer.h.5.mlp.c_fc.bias', 'gpt.transformer.h.5.mlp.c_proj.bias', 'gpt.transformer.h.6.ln_1.linear.bias', 'gpt.transformer.h.6.attn.c_attn.bias', 'gpt.transformer.h.6.attn.c_proj.bias', 'gpt.transformer.h.6.ln_2.linear.bias', 'gpt.transformer.h.6.crossattention.c_attn.bias', 'gpt.transformer.h.6.crossattention.q_attn.bias', 'gpt.transformer.h.6.crossattention.c_proj.bias', 'gpt.transformer.h.6.ln_cross_attn.linear.bias', 'gpt.transformer.h.6.mlp.c_fc.bias', 'gpt.transformer.h.6.mlp.c_proj.bias', 'gpt.transformer.h.7.ln_1.linear.bias', 'gpt.transformer.h.7.attn.c_attn.bias', 'gpt.transformer.h.7.attn.c_proj.bias', 'gpt.transformer.h.7.ln_2.linear.bias', 'gpt.transformer.h.7.crossattention.c_attn.bias', 'gpt.transformer.h.7.crossattention.q_attn.bias', 'gpt.transformer.h.7.crossattention.c_proj.bias', 'gpt.transformer.h.7.ln_cross_attn.linear.bias', 'gpt.transformer.h.7.mlp.c_fc.bias', 'gpt.transformer.h.7.mlp.c_proj.bias', 'gpt.transformer.h.8.ln_1.linear.bias', 'gpt.transformer.h.8.attn.c_attn.bias', 'gpt.transformer.h.8.attn.c_proj.bias', 'gpt.transformer.h.8.ln_2.linear.bias', 'gpt.transformer.h.8.crossattention.c_attn.bias', 'gpt.transformer.h.8.crossattention.q_attn.bias', 'gpt.transformer.h.8.crossattention.c_proj.bias', 'gpt.transformer.h.8.ln_cross_attn.linear.bias', 'gpt.transformer.h.8.mlp.c_fc.bias', 'gpt.transformer.h.8.mlp.c_proj.bias', 'gpt.transformer.h.9.ln_1.linear.bias', 'gpt.transformer.h.9.attn.c_attn.bias', 'gpt.transformer.h.9.attn.c_proj.bias', 'gpt.transformer.h.9.ln_2.linear.bias', 'gpt.transformer.h.9.crossattention.c_attn.bias', 'gpt.transformer.h.9.crossattention.q_attn.bias', 'gpt.transformer.h.9.crossattention.c_proj.bias', 'gpt.transformer.h.9.ln_cross_attn.linear.bias', 'gpt.transformer.h.9.mlp.c_fc.bias', 'gpt.transformer.h.9.mlp.c_proj.bias', 'gpt.transformer.h.10.ln_1.linear.bias', 'gpt.transformer.h.10.attn.c_attn.bias', 'gpt.transformer.h.10.attn.c_proj.bias', 'gpt.transformer.h.10.ln_2.linear.bias', 'gpt.transformer.h.10.crossattention.c_attn.bias', 'gpt.transformer.h.10.crossattention.q_attn.bias', 'gpt.transformer.h.10.crossattention.c_proj.bias', 'gpt.transformer.h.10.ln_cross_attn.linear.bias', 'gpt.transformer.h.10.mlp.c_fc.bias', 'gpt.transformer.h.10.mlp.c_proj.bias', 'gpt.transformer.h.11.ln_1.linear.bias', 'gpt.transformer.h.11.attn.c_attn.bias', 'gpt.transformer.h.11.attn.c_proj.bias', 'gpt.transformer.h.11.ln_2.linear.bias', 'gpt.transformer.h.11.crossattention.c_attn.bias', 'gpt.transformer.h.11.crossattention.q_attn.bias', 'gpt.transformer.h.11.crossattention.c_proj.bias', 'gpt.transformer.h.11.ln_cross_attn.linear.bias', 'gpt.transformer.h.11.mlp.c_fc.bias', 'gpt.transformer.h.11.mlp.c_proj.bias', 'gpt.transformer.ln_f.weight', 'gpt.transformer.ln_f.bias', 'len_head.model.0.bias', 'len_head.model.2.bias', 'clip_project.model.0.bias', 'clip_project.model.2.bias']
Has decay params: ['pad_embedding', 'gpt.transformer.wte.weight', 'gpt.transformer.wpe.weight', 'gpt.transformer.h.0.ln_1.linear.weight', 'gpt.transformer.h.0.attn.c_attn.weight', 'gpt.transformer.h.0.attn.c_proj.weight', 'gpt.transformer.h.0.ln_2.linear.weight', 'gpt.transformer.h.0.crossattention.c_attn.weight', 'gpt.transformer.h.0.crossattention.q_attn.weight', 'gpt.transformer.h.0.crossattention.c_proj.weight', 'gpt.transformer.h.0.ln_cross_attn.linear.weight', 'gpt.transformer.h.0.mlp.c_fc.weight', 'gpt.transformer.h.0.mlp.c_proj.weight', 'gpt.transformer.h.1.ln_1.linear.weight', 'gpt.transformer.h.1.attn.c_attn.weight', 'gpt.transformer.h.1.attn.c_proj.weight', 'gpt.transformer.h.1.ln_2.linear.weight', 'gpt.transformer.h.1.crossattention.c_attn.weight', 'gpt.transformer.h.1.crossattention.q_attn.weight', 'gpt.transformer.h.1.crossattention.c_proj.weight', 'gpt.transformer.h.1.ln_cross_attn.linear.weight', 'gpt.transformer.h.1.mlp.c_fc.weight', 'gpt.transformer.h.1.mlp.c_proj.weight', 'gpt.transformer.h.2.ln_1.linear.weight', 'gpt.transformer.h.2.attn.c_attn.weight', 'gpt.transformer.h.2.attn.c_proj.weight', 'gpt.transformer.h.2.ln_2.linear.weight', 'gpt.transformer.h.2.crossattention.c_attn.weight', 'gpt.transformer.h.2.crossattention.q_attn.weight', 'gpt.transformer.h.2.crossattention.c_proj.weight', 'gpt.transformer.h.2.ln_cross_attn.linear.weight', 'gpt.transformer.h.2.mlp.c_fc.weight', 'gpt.transformer.h.2.mlp.c_proj.weight', 'gpt.transformer.h.3.ln_1.linear.weight', 'gpt.transformer.h.3.attn.c_attn.weight', 'gpt.transformer.h.3.attn.c_proj.weight', 'gpt.transformer.h.3.ln_2.linear.weight', 'gpt.transformer.h.3.crossattention.c_attn.weight', 'gpt.transformer.h.3.crossattention.q_attn.weight', 'gpt.transformer.h.3.crossattention.c_proj.weight', 'gpt.transformer.h.3.ln_cross_attn.linear.weight', 'gpt.transformer.h.3.mlp.c_fc.weight', 'gpt.transformer.h.3.mlp.c_proj.weight', 'gpt.transformer.h.4.ln_1.linear.weight', 'gpt.transformer.h.4.attn.c_attn.weight', 'gpt.transformer.h.4.attn.c_proj.weight', 'gpt.transformer.h.4.ln_2.linear.weight', 'gpt.transformer.h.4.crossattention.c_attn.weight', 'gpt.transformer.h.4.crossattention.q_attn.weight', 'gpt.transformer.h.4.crossattention.c_proj.weight', 'gpt.transformer.h.4.ln_cross_attn.linear.weight', 'gpt.transformer.h.4.mlp.c_fc.weight', 'gpt.transformer.h.4.mlp.c_proj.weight', 'gpt.transformer.h.5.ln_1.linear.weight', 'gpt.transformer.h.5.attn.c_attn.weight', 'gpt.transformer.h.5.attn.c_proj.weight', 'gpt.transformer.h.5.ln_2.linear.weight', 'gpt.transformer.h.5.crossattention.c_attn.weight', 'gpt.transformer.h.5.crossattention.q_attn.weight', 'gpt.transformer.h.5.crossattention.c_proj.weight', 'gpt.transformer.h.5.ln_cross_attn.linear.weight', 'gpt.transformer.h.5.mlp.c_fc.weight', 'gpt.transformer.h.5.mlp.c_proj.weight', 'gpt.transformer.h.6.ln_1.linear.weight', 'gpt.transformer.h.6.attn.c_attn.weight', 'gpt.transformer.h.6.attn.c_proj.weight', 'gpt.transformer.h.6.ln_2.linear.weight', 'gpt.transformer.h.6.crossattention.c_attn.weight', 'gpt.transformer.h.6.crossattention.q_attn.weight', 'gpt.transformer.h.6.crossattention.c_proj.weight', 'gpt.transformer.h.6.ln_cross_attn.linear.weight', 'gpt.transformer.h.6.mlp.c_fc.weight', 'gpt.transformer.h.6.mlp.c_proj.weight', 'gpt.transformer.h.7.ln_1.linear.weight', 'gpt.transformer.h.7.attn.c_attn.weight', 'gpt.transformer.h.7.attn.c_proj.weight', 'gpt.transformer.h.7.ln_2.linear.weight', 'gpt.transformer.h.7.crossattention.c_attn.weight', 'gpt.transformer.h.7.crossattention.q_attn.weight', 'gpt.transformer.h.7.crossattention.c_proj.weight', 'gpt.transformer.h.7.ln_cross_attn.linear.weight', 'gpt.transformer.h.7.mlp.c_fc.weight', 'gpt.transformer.h.7.mlp.c_proj.weight', 'gpt.transformer.h.8.ln_1.linear.weight', 'gpt.transformer.h.8.attn.c_attn.weight', 'gpt.transformer.h.8.attn.c_proj.weight', 'gpt.transformer.h.8.ln_2.linear.weight', 'gpt.transformer.h.8.crossattention.c_attn.weight', 'gpt.transformer.h.8.crossattention.q_attn.weight', 'gpt.transformer.h.8.crossattention.c_proj.weight', 'gpt.transformer.h.8.ln_cross_attn.linear.weight', 'gpt.transformer.h.8.mlp.c_fc.weight', 'gpt.transformer.h.8.mlp.c_proj.weight', 'gpt.transformer.h.9.ln_1.linear.weight', 'gpt.transformer.h.9.attn.c_attn.weight', 'gpt.transformer.h.9.attn.c_proj.weight', 'gpt.transformer.h.9.ln_2.linear.weight', 'gpt.transformer.h.9.crossattention.c_attn.weight', 'gpt.transformer.h.9.crossattention.q_attn.weight', 'gpt.transformer.h.9.crossattention.c_proj.weight', 'gpt.transformer.h.9.ln_cross_attn.linear.weight', 'gpt.transformer.h.9.mlp.c_fc.weight', 'gpt.transformer.h.9.mlp.c_proj.weight', 'gpt.transformer.h.10.ln_1.linear.weight', 'gpt.transformer.h.10.attn.c_attn.weight', 'gpt.transformer.h.10.attn.c_proj.weight', 'gpt.transformer.h.10.ln_2.linear.weight', 'gpt.transformer.h.10.crossattention.c_attn.weight', 'gpt.transformer.h.10.crossattention.q_attn.weight', 'gpt.transformer.h.10.crossattention.c_proj.weight', 'gpt.transformer.h.10.ln_cross_attn.linear.weight', 'gpt.transformer.h.10.mlp.c_fc.weight', 'gpt.transformer.h.10.mlp.c_proj.weight', 'gpt.transformer.h.11.ln_1.linear.weight', 'gpt.transformer.h.11.attn.c_attn.weight', 'gpt.transformer.h.11.attn.c_proj.weight', 'gpt.transformer.h.11.ln_2.linear.weight', 'gpt.transformer.h.11.crossattention.c_attn.weight', 'gpt.transformer.h.11.crossattention.q_attn.weight', 'gpt.transformer.h.11.crossattention.c_proj.weight', 'gpt.transformer.h.11.ln_cross_attn.linear.weight', 'gpt.transformer.h.11.mlp.c_fc.weight', 'gpt.transformer.h.11.mlp.c_proj.weight', 'len_head.model.0.weight', 'len_head.model.2.weight', 'clip_project.model.0.weight', 'clip_project.model.2.weight']
Data size is 566747
Epoch 0

Training epoch 0
Evaling epoch 0
loading annotations into memory...
0:00:00.513341
creating index...
index created!
Loading and preparing results...
DONE (t=0.02s)
creating index...
index created!
tokenization...
RANK and WORLD_SIZE in environ: 1/4
RANK and WORLD_SIZE in environ: 3/4
RANK and WORLD_SIZE in environ: 0/4
RANK and WORLD_SIZE in environ: 2/4

Those are last lines of error log:
Traceback (most recent call last):
File "/home/v.silvio/diffusion-image-captioning-main/Paper2/DDCap-main/train.py", line 885, in
main(args)
File "/home/v.silvio/diffusion-image-captioning-main/Paper2/DDCap-main/train.py", line 850, in main
result = val(model, epoch, val_dataloader, args)
File "/home/v.silvio/.conda/envs/DDCap/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/v.silvio/diffusion-image-captioning-main/Paper2/DDCap-main/train.py", line 680, in val
result = evaluate_on_coco_caption(result_all,
File "/home/v.silvio/diffusion-image-captioning-main/Paper2/DDCap-main/misc.py", line 463, in evaluate_on_coco_caption
cocoEval.evaluate()
File "/home/v.silvio/diffusion-image-captioning-main/Paper2/DDCap-main/captioneval/coco_caption/pycocoevalcap/eval.py", line 41, in evaluate
self.tokenize()
File "/home/v.silvio/diffusion-image-captioning-main/Paper2/DDCap-main/captioneval/coco_caption/pycocoevalcap/eval.py", line 37, in tokenize
self.gts = tokenizer.tokenize(gts)
File "/home/v.silvio/diffusion-image-captioning-main/Paper2/DDCap-main/captioneval/coco_caption/pycocoevalcap/tokenizer/ptbtokenizer.py", line 51, in tokenize
p_tokenizer = subprocess.Popen(cmd, cwd=path_to_jar_dirname,
File "/home/v.silvio/.conda/envs/DDCap/lib/python3.9/subprocess.py", line 951, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/home/v.silvio/.conda/envs/DDCap/lib/python3.9/subprocess.py", line 1821, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'java'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 88609) of binary: /home/v.silvio/.conda/envs/DDCap/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3]
role_ranks=[0, 1, 2, 3]
global_ranks=[0, 1, 2, 3]
role_world_sizes=[4, 4, 4, 4]
global_world_sizes=[4, 4, 4, 4]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_jwt9x5s5/none_kh_3ngxe/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_jwt9x5s5/none_kh_3ngxe/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_jwt9x5s5/none_kh_3ngxe/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_jwt9x5s5/none_kh_3ngxe/attempt_1/3/error.json
2023-01-10 16:48:21.757112: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-10 16:48:21.757169: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-10 16:48:21.757215: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-10 16:48:21.757308: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-10 16:48:21.906546: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-01-10 16:48:21.922507: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-01-10 16:48:21.922507: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-01-10 16:48:21.934842: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-01-10 16:48:23.038368: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.038368: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.038933: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.039277: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.039505: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-01-10 16:48:23.039901: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-01-10 16:48:23.051735: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.051904: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.052289: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.052589: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.052744: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-01-10 16:48:23.053045: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Code doesn't crash, it's just waiting for something I can't figure out.

buxiangzhiren · 2023-01-10T15:54:16Z

The logs already reported that "FileNotFoundError: [Errno 2] No such file or directory: 'java'". The problem is that you don't install java.

buxiangzhiren · 2023-01-10T16:03:18Z

The commands are that "$ sudo apt-get update" and "$ sudo apt-get install openjdk-8-jdk". And then you can run "$ java -version" to verify it.

vinevix · 2023-01-10T16:22:41Z

I already have java:
java -version
openjdk version "1.8.0_322"
OpenJDK Runtime Environment (build 1.8.0_322-b06)
OpenJDK 64-Bit Server VM (build 25.322-b06, mixed mode)

but it keep saying FileNotFoundError: [Errno 2] No such file or directory: 'java': 'java'

buxiangzhiren · 2023-01-10T16:45:10Z

I suppose that there are some issues in the setting of your environment variables. Maybe you should check your environment variables or add the path of "java" into the train.py.

buxiangzhiren · 2023-01-10T16:53:47Z

Just like this
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64'
os.environ['JRE_HOME'] = f'{os.environ["JAVA_HOME"]}/jre'
os.environ['CLASSPATH'] = f'.:{os.environ["JAVA_HOME"]}/lib:{os.environ["JRE_HOME"]}/lib'
os.environ['PATH'] = f'{os.environ["JAVA_HOME"]}/bin:{os.environ["PATH"]}'

vinevix · 2023-01-10T17:00:27Z

I've this file in jvm directory: java java-openjdk jre-1.8.0
java-1.7.0-openjdk-1.7.0.261-2.6.22.2.el7_8.x86_64 jre jre-1.8.0-openjdk
java-1.8.0 jre-1.7.0 jre-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk jre-1.7.0-openjdk jre-openjdk
java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64 jre-1.7.0-openjdk-1.7.0.261-2.6.22.2.el7_8.x86_64

Should I set 'java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64' as JAVA HOME?

buxiangzhiren · 2023-01-10T17:06:43Z

I don't know why you don't have the directory "java-8-openjdk-amd64" if you install java successfully. You could try it first.

Should I set 'java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64' as JAVA HOME?

zhangfujunaaa mentioned this issue Mar 31, 2024

为什么我的一直卡在这里RANK and WORLD_SIZE in environ: 0/1 #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training executes first Epoch but then stop itself, how come? #13

Training executes first Epoch but then stop itself, how come? #13

vinevix commented Jan 10, 2023

buxiangzhiren commented Jan 10, 2023

vinevix commented Jan 10, 2023

buxiangzhiren commented Jan 10, 2023

buxiangzhiren commented Jan 10, 2023 •

edited

Loading

vinevix commented Jan 10, 2023

buxiangzhiren commented Jan 10, 2023

buxiangzhiren commented Jan 10, 2023

vinevix commented Jan 10, 2023

buxiangzhiren commented Jan 10, 2023

Training executes first Epoch but then stop itself, how come? #13

Training executes first Epoch but then stop itself, how come? #13

Comments

vinevix commented Jan 10, 2023

buxiangzhiren commented Jan 10, 2023

vinevix commented Jan 10, 2023

buxiangzhiren commented Jan 10, 2023

buxiangzhiren commented Jan 10, 2023 • edited Loading

vinevix commented Jan 10, 2023

buxiangzhiren commented Jan 10, 2023

buxiangzhiren commented Jan 10, 2023

vinevix commented Jan 10, 2023

buxiangzhiren commented Jan 10, 2023

buxiangzhiren commented Jan 10, 2023 •

edited

Loading