Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training executes first Epoch but then stop itself, how come? #13

Open
vinevix opened this issue Jan 10, 2023 · 9 comments
Open

Training executes first Epoch but then stop itself, how come? #13

vinevix opened this issue Jan 10, 2023 · 9 comments

Comments

@vinevix
Copy link

vinevix commented Jan 10, 2023

No description provided.

@buxiangzhiren
Copy link
Owner

Can you show more details (printed logs) ?

@vinevix
Copy link
Author

vinevix commented Jan 10, 2023

That's output logs:
RANK and WORLD_SIZE in environ: 0/4
RANK and WORLD_SIZE in environ: 1/4
RANK and WORLD_SIZE in environ: 3/4
RANK and WORLD_SIZE in environ: 2/4
Train both prefix and GPT
196,220,948 total parameters
No decay params: ['bos_embedding', 'gpt.transformer.h.0.ln_1.linear.bias', 'gpt.transformer.h.0.attn.c_attn.bias', 'gpt.transformer.h.0.attn.c_proj.bias', 'gpt.transformer.h.0.ln_2.linear.bias', 'gpt.transformer.h.0.crossattention.c_attn.bias', 'gpt.transformer.h.0.crossattention.q_attn.bias', 'gpt.transformer.h.0.crossattention.c_proj.bias', 'gpt.transformer.h.0.ln_cross_attn.linear.bias', 'gpt.transformer.h.0.mlp.c_fc.bias', 'gpt.transformer.h.0.mlp.c_proj.bias', 'gpt.transformer.h.1.ln_1.linear.bias', 'gpt.transformer.h.1.attn.c_attn.bias', 'gpt.transformer.h.1.attn.c_proj.bias', 'gpt.transformer.h.1.ln_2.linear.bias', 'gpt.transformer.h.1.crossattention.c_attn.bias', 'gpt.transformer.h.1.crossattention.q_attn.bias', 'gpt.transformer.h.1.crossattention.c_proj.bias', 'gpt.transformer.h.1.ln_cross_attn.linear.bias', 'gpt.transformer.h.1.mlp.c_fc.bias', 'gpt.transformer.h.1.mlp.c_proj.bias', 'gpt.transformer.h.2.ln_1.linear.bias', 'gpt.transformer.h.2.attn.c_attn.bias', 'gpt.transformer.h.2.attn.c_proj.bias', 'gpt.transformer.h.2.ln_2.linear.bias', 'gpt.transformer.h.2.crossattention.c_attn.bias', 'gpt.transformer.h.2.crossattention.q_attn.bias', 'gpt.transformer.h.2.crossattention.c_proj.bias', 'gpt.transformer.h.2.ln_cross_attn.linear.bias', 'gpt.transformer.h.2.mlp.c_fc.bias', 'gpt.transformer.h.2.mlp.c_proj.bias', 'gpt.transformer.h.3.ln_1.linear.bias', 'gpt.transformer.h.3.attn.c_attn.bias', 'gpt.transformer.h.3.attn.c_proj.bias', 'gpt.transformer.h.3.ln_2.linear.bias', 'gpt.transformer.h.3.crossattention.c_attn.bias', 'gpt.transformer.h.3.crossattention.q_attn.bias', 'gpt.transformer.h.3.crossattention.c_proj.bias', 'gpt.transformer.h.3.ln_cross_attn.linear.bias', 'gpt.transformer.h.3.mlp.c_fc.bias', 'gpt.transformer.h.3.mlp.c_proj.bias', 'gpt.transformer.h.4.ln_1.linear.bias', 'gpt.transformer.h.4.attn.c_attn.bias', 'gpt.transformer.h.4.attn.c_proj.bias', 'gpt.transformer.h.4.ln_2.linear.bias', 'gpt.transformer.h.4.crossattention.c_attn.bias', 'gpt.transformer.h.4.crossattention.q_attn.bias', 'gpt.transformer.h.4.crossattention.c_proj.bias', 'gpt.transformer.h.4.ln_cross_attn.linear.bias', 'gpt.transformer.h.4.mlp.c_fc.bias', 'gpt.transformer.h.4.mlp.c_proj.bias', 'gpt.transformer.h.5.ln_1.linear.bias', 'gpt.transformer.h.5.attn.c_attn.bias', 'gpt.transformer.h.5.attn.c_proj.bias', 'gpt.transformer.h.5.ln_2.linear.bias', 'gpt.transformer.h.5.crossattention.c_attn.bias', 'gpt.transformer.h.5.crossattention.q_attn.bias', 'gpt.transformer.h.5.crossattention.c_proj.bias', 'gpt.transformer.h.5.ln_cross_attn.linear.bias', 'gpt.transformer.h.5.mlp.c_fc.bias', 'gpt.transformer.h.5.mlp.c_proj.bias', 'gpt.transformer.h.6.ln_1.linear.bias', 'gpt.transformer.h.6.attn.c_attn.bias', 'gpt.transformer.h.6.attn.c_proj.bias', 'gpt.transformer.h.6.ln_2.linear.bias', 'gpt.transformer.h.6.crossattention.c_attn.bias', 'gpt.transformer.h.6.crossattention.q_attn.bias', 'gpt.transformer.h.6.crossattention.c_proj.bias', 'gpt.transformer.h.6.ln_cross_attn.linear.bias', 'gpt.transformer.h.6.mlp.c_fc.bias', 'gpt.transformer.h.6.mlp.c_proj.bias', 'gpt.transformer.h.7.ln_1.linear.bias', 'gpt.transformer.h.7.attn.c_attn.bias', 'gpt.transformer.h.7.attn.c_proj.bias', 'gpt.transformer.h.7.ln_2.linear.bias', 'gpt.transformer.h.7.crossattention.c_attn.bias', 'gpt.transformer.h.7.crossattention.q_attn.bias', 'gpt.transformer.h.7.crossattention.c_proj.bias', 'gpt.transformer.h.7.ln_cross_attn.linear.bias', 'gpt.transformer.h.7.mlp.c_fc.bias', 'gpt.transformer.h.7.mlp.c_proj.bias', 'gpt.transformer.h.8.ln_1.linear.bias', 'gpt.transformer.h.8.attn.c_attn.bias', 'gpt.transformer.h.8.attn.c_proj.bias', 'gpt.transformer.h.8.ln_2.linear.bias', 'gpt.transformer.h.8.crossattention.c_attn.bias', 'gpt.transformer.h.8.crossattention.q_attn.bias', 'gpt.transformer.h.8.crossattention.c_proj.bias', 'gpt.transformer.h.8.ln_cross_attn.linear.bias', 'gpt.transformer.h.8.mlp.c_fc.bias', 'gpt.transformer.h.8.mlp.c_proj.bias', 'gpt.transformer.h.9.ln_1.linear.bias', 'gpt.transformer.h.9.attn.c_attn.bias', 'gpt.transformer.h.9.attn.c_proj.bias', 'gpt.transformer.h.9.ln_2.linear.bias', 'gpt.transformer.h.9.crossattention.c_attn.bias', 'gpt.transformer.h.9.crossattention.q_attn.bias', 'gpt.transformer.h.9.crossattention.c_proj.bias', 'gpt.transformer.h.9.ln_cross_attn.linear.bias', 'gpt.transformer.h.9.mlp.c_fc.bias', 'gpt.transformer.h.9.mlp.c_proj.bias', 'gpt.transformer.h.10.ln_1.linear.bias', 'gpt.transformer.h.10.attn.c_attn.bias', 'gpt.transformer.h.10.attn.c_proj.bias', 'gpt.transformer.h.10.ln_2.linear.bias', 'gpt.transformer.h.10.crossattention.c_attn.bias', 'gpt.transformer.h.10.crossattention.q_attn.bias', 'gpt.transformer.h.10.crossattention.c_proj.bias', 'gpt.transformer.h.10.ln_cross_attn.linear.bias', 'gpt.transformer.h.10.mlp.c_fc.bias', 'gpt.transformer.h.10.mlp.c_proj.bias', 'gpt.transformer.h.11.ln_1.linear.bias', 'gpt.transformer.h.11.attn.c_attn.bias', 'gpt.transformer.h.11.attn.c_proj.bias', 'gpt.transformer.h.11.ln_2.linear.bias', 'gpt.transformer.h.11.crossattention.c_attn.bias', 'gpt.transformer.h.11.crossattention.q_attn.bias', 'gpt.transformer.h.11.crossattention.c_proj.bias', 'gpt.transformer.h.11.ln_cross_attn.linear.bias', 'gpt.transformer.h.11.mlp.c_fc.bias', 'gpt.transformer.h.11.mlp.c_proj.bias', 'gpt.transformer.ln_f.weight', 'gpt.transformer.ln_f.bias', 'len_head.model.0.bias', 'len_head.model.2.bias', 'clip_project.model.0.bias', 'clip_project.model.2.bias']
Has decay params: ['pad_embedding', 'gpt.transformer.wte.weight', 'gpt.transformer.wpe.weight', 'gpt.transformer.h.0.ln_1.linear.weight', 'gpt.transformer.h.0.attn.c_attn.weight', 'gpt.transformer.h.0.attn.c_proj.weight', 'gpt.transformer.h.0.ln_2.linear.weight', 'gpt.transformer.h.0.crossattention.c_attn.weight', 'gpt.transformer.h.0.crossattention.q_attn.weight', 'gpt.transformer.h.0.crossattention.c_proj.weight', 'gpt.transformer.h.0.ln_cross_attn.linear.weight', 'gpt.transformer.h.0.mlp.c_fc.weight', 'gpt.transformer.h.0.mlp.c_proj.weight', 'gpt.transformer.h.1.ln_1.linear.weight', 'gpt.transformer.h.1.attn.c_attn.weight', 'gpt.transformer.h.1.attn.c_proj.weight', 'gpt.transformer.h.1.ln_2.linear.weight', 'gpt.transformer.h.1.crossattention.c_attn.weight', 'gpt.transformer.h.1.crossattention.q_attn.weight', 'gpt.transformer.h.1.crossattention.c_proj.weight', 'gpt.transformer.h.1.ln_cross_attn.linear.weight', 'gpt.transformer.h.1.mlp.c_fc.weight', 'gpt.transformer.h.1.mlp.c_proj.weight', 'gpt.transformer.h.2.ln_1.linear.weight', 'gpt.transformer.h.2.attn.c_attn.weight', 'gpt.transformer.h.2.attn.c_proj.weight', 'gpt.transformer.h.2.ln_2.linear.weight', 'gpt.transformer.h.2.crossattention.c_attn.weight', 'gpt.transformer.h.2.crossattention.q_attn.weight', 'gpt.transformer.h.2.crossattention.c_proj.weight', 'gpt.transformer.h.2.ln_cross_attn.linear.weight', 'gpt.transformer.h.2.mlp.c_fc.weight', 'gpt.transformer.h.2.mlp.c_proj.weight', 'gpt.transformer.h.3.ln_1.linear.weight', 'gpt.transformer.h.3.attn.c_attn.weight', 'gpt.transformer.h.3.attn.c_proj.weight', 'gpt.transformer.h.3.ln_2.linear.weight', 'gpt.transformer.h.3.crossattention.c_attn.weight', 'gpt.transformer.h.3.crossattention.q_attn.weight', 'gpt.transformer.h.3.crossattention.c_proj.weight', 'gpt.transformer.h.3.ln_cross_attn.linear.weight', 'gpt.transformer.h.3.mlp.c_fc.weight', 'gpt.transformer.h.3.mlp.c_proj.weight', 'gpt.transformer.h.4.ln_1.linear.weight', 'gpt.transformer.h.4.attn.c_attn.weight', 'gpt.transformer.h.4.attn.c_proj.weight', 'gpt.transformer.h.4.ln_2.linear.weight', 'gpt.transformer.h.4.crossattention.c_attn.weight', 'gpt.transformer.h.4.crossattention.q_attn.weight', 'gpt.transformer.h.4.crossattention.c_proj.weight', 'gpt.transformer.h.4.ln_cross_attn.linear.weight', 'gpt.transformer.h.4.mlp.c_fc.weight', 'gpt.transformer.h.4.mlp.c_proj.weight', 'gpt.transformer.h.5.ln_1.linear.weight', 'gpt.transformer.h.5.attn.c_attn.weight', 'gpt.transformer.h.5.attn.c_proj.weight', 'gpt.transformer.h.5.ln_2.linear.weight', 'gpt.transformer.h.5.crossattention.c_attn.weight', 'gpt.transformer.h.5.crossattention.q_attn.weight', 'gpt.transformer.h.5.crossattention.c_proj.weight', 'gpt.transformer.h.5.ln_cross_attn.linear.weight', 'gpt.transformer.h.5.mlp.c_fc.weight', 'gpt.transformer.h.5.mlp.c_proj.weight', 'gpt.transformer.h.6.ln_1.linear.weight', 'gpt.transformer.h.6.attn.c_attn.weight', 'gpt.transformer.h.6.attn.c_proj.weight', 'gpt.transformer.h.6.ln_2.linear.weight', 'gpt.transformer.h.6.crossattention.c_attn.weight', 'gpt.transformer.h.6.crossattention.q_attn.weight', 'gpt.transformer.h.6.crossattention.c_proj.weight', 'gpt.transformer.h.6.ln_cross_attn.linear.weight', 'gpt.transformer.h.6.mlp.c_fc.weight', 'gpt.transformer.h.6.mlp.c_proj.weight', 'gpt.transformer.h.7.ln_1.linear.weight', 'gpt.transformer.h.7.attn.c_attn.weight', 'gpt.transformer.h.7.attn.c_proj.weight', 'gpt.transformer.h.7.ln_2.linear.weight', 'gpt.transformer.h.7.crossattention.c_attn.weight', 'gpt.transformer.h.7.crossattention.q_attn.weight', 'gpt.transformer.h.7.crossattention.c_proj.weight', 'gpt.transformer.h.7.ln_cross_attn.linear.weight', 'gpt.transformer.h.7.mlp.c_fc.weight', 'gpt.transformer.h.7.mlp.c_proj.weight', 'gpt.transformer.h.8.ln_1.linear.weight', 'gpt.transformer.h.8.attn.c_attn.weight', 'gpt.transformer.h.8.attn.c_proj.weight', 'gpt.transformer.h.8.ln_2.linear.weight', 'gpt.transformer.h.8.crossattention.c_attn.weight', 'gpt.transformer.h.8.crossattention.q_attn.weight', 'gpt.transformer.h.8.crossattention.c_proj.weight', 'gpt.transformer.h.8.ln_cross_attn.linear.weight', 'gpt.transformer.h.8.mlp.c_fc.weight', 'gpt.transformer.h.8.mlp.c_proj.weight', 'gpt.transformer.h.9.ln_1.linear.weight', 'gpt.transformer.h.9.attn.c_attn.weight', 'gpt.transformer.h.9.attn.c_proj.weight', 'gpt.transformer.h.9.ln_2.linear.weight', 'gpt.transformer.h.9.crossattention.c_attn.weight', 'gpt.transformer.h.9.crossattention.q_attn.weight', 'gpt.transformer.h.9.crossattention.c_proj.weight', 'gpt.transformer.h.9.ln_cross_attn.linear.weight', 'gpt.transformer.h.9.mlp.c_fc.weight', 'gpt.transformer.h.9.mlp.c_proj.weight', 'gpt.transformer.h.10.ln_1.linear.weight', 'gpt.transformer.h.10.attn.c_attn.weight', 'gpt.transformer.h.10.attn.c_proj.weight', 'gpt.transformer.h.10.ln_2.linear.weight', 'gpt.transformer.h.10.crossattention.c_attn.weight', 'gpt.transformer.h.10.crossattention.q_attn.weight', 'gpt.transformer.h.10.crossattention.c_proj.weight', 'gpt.transformer.h.10.ln_cross_attn.linear.weight', 'gpt.transformer.h.10.mlp.c_fc.weight', 'gpt.transformer.h.10.mlp.c_proj.weight', 'gpt.transformer.h.11.ln_1.linear.weight', 'gpt.transformer.h.11.attn.c_attn.weight', 'gpt.transformer.h.11.attn.c_proj.weight', 'gpt.transformer.h.11.ln_2.linear.weight', 'gpt.transformer.h.11.crossattention.c_attn.weight', 'gpt.transformer.h.11.crossattention.q_attn.weight', 'gpt.transformer.h.11.crossattention.c_proj.weight', 'gpt.transformer.h.11.ln_cross_attn.linear.weight', 'gpt.transformer.h.11.mlp.c_fc.weight', 'gpt.transformer.h.11.mlp.c_proj.weight', 'len_head.model.0.weight', 'len_head.model.2.weight', 'clip_project.model.0.weight', 'clip_project.model.2.weight']
Data size is 566747
Epoch 0

Training epoch 0
Evaling epoch 0
loading annotations into memory...
0:00:00.513341
creating index...
index created!
Loading and preparing results...
DONE (t=0.02s)
creating index...
index created!
tokenization...
RANK and WORLD_SIZE in environ: 1/4
RANK and WORLD_SIZE in environ: 3/4
RANK and WORLD_SIZE in environ: 0/4
RANK and WORLD_SIZE in environ: 2/4

Those are last lines of error log:
Traceback (most recent call last):
File "/home/v.silvio/diffusion-image-captioning-main/Paper2/DDCap-main/train.py", line 885, in
main(args)
File "/home/v.silvio/diffusion-image-captioning-main/Paper2/DDCap-main/train.py", line 850, in main
result = val(model, epoch, val_dataloader, args)
File "/home/v.silvio/.conda/envs/DDCap/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/v.silvio/diffusion-image-captioning-main/Paper2/DDCap-main/train.py", line 680, in val
result = evaluate_on_coco_caption(result_all,
File "/home/v.silvio/diffusion-image-captioning-main/Paper2/DDCap-main/misc.py", line 463, in evaluate_on_coco_caption
cocoEval.evaluate()
File "/home/v.silvio/diffusion-image-captioning-main/Paper2/DDCap-main/captioneval/coco_caption/pycocoevalcap/eval.py", line 41, in evaluate
self.tokenize()
File "/home/v.silvio/diffusion-image-captioning-main/Paper2/DDCap-main/captioneval/coco_caption/pycocoevalcap/eval.py", line 37, in tokenize
self.gts = tokenizer.tokenize(gts)
File "/home/v.silvio/diffusion-image-captioning-main/Paper2/DDCap-main/captioneval/coco_caption/pycocoevalcap/tokenizer/ptbtokenizer.py", line 51, in tokenize
p_tokenizer = subprocess.Popen(cmd, cwd=path_to_jar_dirname,
File "/home/v.silvio/.conda/envs/DDCap/lib/python3.9/subprocess.py", line 951, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/home/v.silvio/.conda/envs/DDCap/lib/python3.9/subprocess.py", line 1821, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'java'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 88609) of binary: /home/v.silvio/.conda/envs/DDCap/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3]
role_ranks=[0, 1, 2, 3]
global_ranks=[0, 1, 2, 3]
role_world_sizes=[4, 4, 4, 4]
global_world_sizes=[4, 4, 4, 4]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_jwt9x5s5/none_kh_3ngxe/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_jwt9x5s5/none_kh_3ngxe/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_jwt9x5s5/none_kh_3ngxe/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_jwt9x5s5/none_kh_3ngxe/attempt_1/3/error.json
2023-01-10 16:48:21.757112: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-10 16:48:21.757169: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-10 16:48:21.757215: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-10 16:48:21.757308: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-10 16:48:21.906546: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-01-10 16:48:21.922507: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-01-10 16:48:21.922507: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-01-10 16:48:21.934842: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-01-10 16:48:23.038368: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.038368: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.038933: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.039277: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.039505: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-01-10 16:48:23.039901: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-01-10 16:48:23.051735: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.051904: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.052289: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.052589: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/share/sw/anaconda/anaconda3/lib:/opt/share/cuda/cuda-11.3/lib64
2023-01-10 16:48:23.052744: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-01-10 16:48:23.053045: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Code doesn't crash, it's just waiting for something I can't figure out.

@buxiangzhiren
Copy link
Owner

The logs already reported that "FileNotFoundError: [Errno 2] No such file or directory: 'java'". The problem is that you don't install java.

@buxiangzhiren
Copy link
Owner

buxiangzhiren commented Jan 10, 2023

The commands are that "$ sudo apt-get update" and "$ sudo apt-get install openjdk-8-jdk". And then you can run "$ java -version" to verify it.

@vinevix
Copy link
Author

vinevix commented Jan 10, 2023

I already have java:
java -version
openjdk version "1.8.0_322"
OpenJDK Runtime Environment (build 1.8.0_322-b06)
OpenJDK 64-Bit Server VM (build 25.322-b06, mixed mode)

but it keep saying FileNotFoundError: [Errno 2] No such file or directory: 'java': 'java'

@buxiangzhiren
Copy link
Owner

I suppose that there are some issues in the setting of your environment variables. Maybe you should check your environment variables or add the path of "java" into the train.py.

@buxiangzhiren
Copy link
Owner

Just like this
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64'
os.environ['JRE_HOME'] = f'{os.environ["JAVA_HOME"]}/jre'
os.environ['CLASSPATH'] = f'.:{os.environ["JAVA_HOME"]}/lib:{os.environ["JRE_HOME"]}/lib'
os.environ['PATH'] = f'{os.environ["JAVA_HOME"]}/bin:{os.environ["PATH"]}'

@vinevix
Copy link
Author

vinevix commented Jan 10, 2023

I've this file in jvm directory: java java-openjdk jre-1.8.0
java-1.7.0-openjdk-1.7.0.261-2.6.22.2.el7_8.x86_64 jre jre-1.8.0-openjdk
java-1.8.0 jre-1.7.0 jre-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64
java-1.8.0-openjdk jre-1.7.0-openjdk jre-openjdk
java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64 jre-1.7.0-openjdk-1.7.0.261-2.6.22.2.el7_8.x86_64

Should I set 'java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64' as JAVA HOME?

@buxiangzhiren
Copy link
Owner

I don't know why you don't have the directory "java-8-openjdk-amd64" if you install java successfully. You could try it first.

Should I set 'java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64' as JAVA HOME?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants