执行bash scripts/generate.sh --input-source interactive时出现的错误。大佬救救！ #214

Eternal-Yan · 2023-12-20T04:38:26Z

(glm130b) zdbp@zdbp-ThinkStation-P920:~/GLM-130B-main$ bash scripts/generate.sh --input-source interactive
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING]
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:31:38,081] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,121] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,198] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,205] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,225] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,250] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,261] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,294] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
using world size: 8 and model-parallel size: 8

padded vocab (size: 150528) with 0 dummy tokens (new size: 150528)
WARNING: No training data specified
initializing model parallel with size 8
WARNING: No training data specified
Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere.
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199262 closing signal SIGTERM
[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199263 closing signal SIGTERM
[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199265 closing signal SIGTERM
[2023-12-20 12:31:45,600] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 199266) of binary: /home/zdbp/anaconda3/envs/glm130b/bin/python
Traceback (most recent call last):
File "/home/zdbp/anaconda3/envs/glm130b/bin/torchrun", line 8, in
sys.exit(main())
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/zdbp/PengJian/GLM-130B-main/generate.py FAILED

Failures:
[1]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 199267)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 199268)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 199269)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 199270)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 199266)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

(glm130b) zdbp@zdbp-ThinkStation-P920:/PengJian/GLM-130B-main$ pip install torchrun
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
ERROR: Could not find a version that satisfies the requirement torchrun (from versions: none)
ERROR: No matching distribution found for torchrun
(glm130b) zdbp@zdbp-ThinkStation-P920:/PengJian/GLM-130B-main$ bash scripts/generate.sh --input-source interactive
python: can't open file '/home/zdbp/PengJian/GLM-130B-main/8': [Errno 2] No such file or directory
(glm130b) zdbp@zdbp-ThinkStation-P920:/PengJian/GLM-130B-main$
(glm130b) zdbp@zdbp-ThinkStation-P920:/PengJian/GLM-130B-main$ pip install bminf
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting bminf
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1b/9b/56bbb3f30672e11e64ab0da315459f65d5ae8608e379a41ea6ef442dffb6/bminf-2.0.1-py3-none-any.whl (52 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.3/52.3 kB 690.4 kB/s eta 0:00:00
Requirement already satisfied: torch in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (2.1.1+cu121)
Requirement already satisfied: cpm-kernels>=1.0.9 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (1.0.11)
Requirement already satisfied: typing-extensions in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (4.9.0)
Requirement already satisfied: filelock in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.9.0)
Requirement already satisfied: sympy in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (1.12)
Requirement already satisfied: networkx in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.0)
Requirement already satisfied: jinja2 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.1.2)
Requirement already satisfied: fsspec in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (2023.10.0)
Requirement already satisfied: triton==2.1.0 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (2.1.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from jinja2->torch->bminf) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from sympy->torch->bminf) (1.3.0)
Installing collected packages: bminf
Successfully installed bminf-2.0.1
(glm130b) zdbp@zdbp-ThinkStation-P920:~/PengJian/GLM-130B-main$ bash scripts/generate.sh --input-source interactive
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING]
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:36:34,707] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:34,756] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:34,961] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,021] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,036] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,073] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,147] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,153] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
WARNING: No training data specified
using world size: 8 and model-parallel size: 8

padded vocab (size: 150528) with 0 dummy tokens (new size: 150528)
initializing model parallel with size 8
Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere.
WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2023-12-20 12:36:42,265] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209418 closing signal SIGTERM
[2023-12-20 12:36:42,265] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209419 closing signal SIGTERM
[2023-12-20 12:36:42,266] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209420 closing signal SIGTERM
[2023-12-20 12:36:42,431] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 209422) of binary: /home/zdbp/anaconda3/envs/glm130b/bin/python
Traceback (most recent call last):
File "/home/zdbp/anaconda3/envs/glm130b/bin/torchrun", line 8, in
sys.exit(main())
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/zdbp/PengJian/GLM-130B-main/generate.py FAILED

Failures:
[1]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 209423)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 209424)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 209425)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 209426)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 209422)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

The text was updated successfully, but these errors were encountered:

dahaobenhao · 2024-02-28T10:03:17Z

请问解决了吗，我也碰到了一样的错误，救

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

执行bash scripts/generate.sh --input-source interactive时出现的错误。大佬救救！ #214

执行bash scripts/generate.sh --input-source interactive时出现的错误。大佬救救！ #214

Eternal-Yan commented Dec 20, 2023

dahaobenhao commented Feb 28, 2024

执行bash scripts/generate.sh --input-source interactive时出现的错误。大佬救救！ #214

执行bash scripts/generate.sh --input-source interactive时出现的错误。大佬救救！ #214

Comments

Eternal-Yan commented Dec 20, 2023

/home/zdbp/PengJian/GLM-130B-main/generate.py FAILED

Root Cause (first observed failure): [0]: time : 2023-12-20_12:31:45 host : zdbp-ThinkStation-P920 rank : 3 (local_rank: 3) exitcode : 1 (pid: 199266) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

/home/zdbp/PengJian/GLM-130B-main/generate.py FAILED

Root Cause (first observed failure): [0]: time : 2023-12-20_12:36:42 host : zdbp-ThinkStation-P920 rank : 3 (local_rank: 3) exitcode : 1 (pid: 209422) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

dahaobenhao commented Feb 28, 2024

Root Cause (first observed failure):
[0]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 199266)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 209422)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html