You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(glm130b) zdbp@zdbp-ThinkStation-P920:~/GLM-130B-main$ bash scripts/generate.sh --input-source interactive
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING]
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:31:38,081] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,121] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,198] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,205] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,225] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,250] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,261] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,294] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
using world size: 8 and model-parallel size: 8
padded vocab (size: 150528) with 0 dummy tokens (new size: 150528)
WARNING: No training data specified
initializing model parallel with size 8
WARNING: No training data specified
Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere.
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199262 closing signal SIGTERM
[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199263 closing signal SIGTERM
[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199265 closing signal SIGTERM
[2023-12-20 12:31:45,600] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 199266) of binary: /home/zdbp/anaconda3/envs/glm130b/bin/python
Traceback (most recent call last):
File "/home/zdbp/anaconda3/envs/glm130b/bin/torchrun", line 8, in
sys.exit(main())
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
(glm130b) zdbp@zdbp-ThinkStation-P920:/PengJian/GLM-130B-main$ pip install torchrun
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
ERROR: Could not find a version that satisfies the requirement torchrun (from versions: none)
ERROR: No matching distribution found for torchrun
(glm130b) zdbp@zdbp-ThinkStation-P920:/PengJian/GLM-130B-main$ bash scripts/generate.sh --input-source interactive
python: can't open file '/home/zdbp/PengJian/GLM-130B-main/8': [Errno 2] No such file or directory
(glm130b) zdbp@zdbp-ThinkStation-P920:/PengJian/GLM-130B-main$
(glm130b) zdbp@zdbp-ThinkStation-P920:/PengJian/GLM-130B-main$ pip install bminf
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting bminf
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1b/9b/56bbb3f30672e11e64ab0da315459f65d5ae8608e379a41ea6ef442dffb6/bminf-2.0.1-py3-none-any.whl (52 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.3/52.3 kB 690.4 kB/s eta 0:00:00
Requirement already satisfied: torch in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (2.1.1+cu121)
Requirement already satisfied: cpm-kernels>=1.0.9 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (1.0.11)
Requirement already satisfied: typing-extensions in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (4.9.0)
Requirement already satisfied: filelock in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.9.0)
Requirement already satisfied: sympy in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (1.12)
Requirement already satisfied: networkx in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.0)
Requirement already satisfied: jinja2 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.1.2)
Requirement already satisfied: fsspec in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (2023.10.0)
Requirement already satisfied: triton==2.1.0 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (2.1.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from jinja2->torch->bminf) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from sympy->torch->bminf) (1.3.0)
Installing collected packages: bminf
Successfully installed bminf-2.0.1
(glm130b) zdbp@zdbp-ThinkStation-P920:~/PengJian/GLM-130B-main$ bash scripts/generate.sh --input-source interactive
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING]
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:36:34,707] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:34,756] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:34,961] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,021] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,036] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,073] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,147] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,153] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
WARNING: No training data specified
using world size: 8 and model-parallel size: 8
padded vocab (size: 150528) with 0 dummy tokens (new size: 150528)
initializing model parallel with size 8
Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere.
WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[2023-12-20 12:36:42,265] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209418 closing signal SIGTERM
[2023-12-20 12:36:42,265] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209419 closing signal SIGTERM
[2023-12-20 12:36:42,266] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209420 closing signal SIGTERM
[2023-12-20 12:36:42,431] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 209422) of binary: /home/zdbp/anaconda3/envs/glm130b/bin/python
Traceback (most recent call last):
File "/home/zdbp/anaconda3/envs/glm130b/bin/torchrun", line 8, in
sys.exit(main())
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
(glm130b) zdbp@zdbp-ThinkStation-P920:~/GLM-130B-main$ bash scripts/generate.sh --input-source interactive
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING]
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:31:38,081] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,121] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,198] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,205] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,225] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,250] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,261] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,294] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
using world size: 8 and model-parallel size: 8
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199262 closing signal SIGTERM
[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199263 closing signal SIGTERM
[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199265 closing signal SIGTERM
[2023-12-20 12:31:45,600] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 199266) of binary: /home/zdbp/anaconda3/envs/glm130b/bin/python
Traceback (most recent call last):
File "/home/zdbp/anaconda3/envs/glm130b/bin/torchrun", line 8, in
sys.exit(main())
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/zdbp/PengJian/GLM-130B-main/generate.py FAILED
Failures:
[1]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 199267)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 199268)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 199269)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 199270)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 199266)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
(glm130b) zdbp@zdbp-ThinkStation-P920:
/PengJian/GLM-130B-main$ pip install torchrun/PengJian/GLM-130B-main$ bash scripts/generate.sh --input-source interactiveLooking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
ERROR: Could not find a version that satisfies the requirement torchrun (from versions: none)
ERROR: No matching distribution found for torchrun
(glm130b) zdbp@zdbp-ThinkStation-P920:
python: can't open file '/home/zdbp/PengJian/GLM-130B-main/8': [Errno 2] No such file or directory
(glm130b) zdbp@zdbp-ThinkStation-P920:
/PengJian/GLM-130B-main$/PengJian/GLM-130B-main$ pip install bminf(glm130b) zdbp@zdbp-ThinkStation-P920:
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting bminf
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1b/9b/56bbb3f30672e11e64ab0da315459f65d5ae8608e379a41ea6ef442dffb6/bminf-2.0.1-py3-none-any.whl (52 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.3/52.3 kB 690.4 kB/s eta 0:00:00
Requirement already satisfied: torch in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (2.1.1+cu121)
Requirement already satisfied: cpm-kernels>=1.0.9 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (1.0.11)
Requirement already satisfied: typing-extensions in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (4.9.0)
Requirement already satisfied: filelock in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.9.0)
Requirement already satisfied: sympy in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (1.12)
Requirement already satisfied: networkx in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.0)
Requirement already satisfied: jinja2 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.1.2)
Requirement already satisfied: fsspec in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (2023.10.0)
Requirement already satisfied: triton==2.1.0 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (2.1.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from jinja2->torch->bminf) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from sympy->torch->bminf) (1.3.0)
Installing collected packages: bminf
Successfully installed bminf-2.0.1
(glm130b) zdbp@zdbp-ThinkStation-P920:~/PengJian/GLM-130B-main$ bash scripts/generate.sh --input-source interactive
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING]
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:36:34,707] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:34,756] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:34,961] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,021] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,036] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,073] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,147] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,153] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
WARNING: No training data specified
using world size: 8 and model-parallel size: 8
WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.[2023-12-20 12:36:42,265] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209418 closing signal SIGTERM
[2023-12-20 12:36:42,265] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209419 closing signal SIGTERM
[2023-12-20 12:36:42,266] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209420 closing signal SIGTERM
[2023-12-20 12:36:42,431] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 209422) of binary: /home/zdbp/anaconda3/envs/glm130b/bin/python
Traceback (most recent call last):
File "/home/zdbp/anaconda3/envs/glm130b/bin/torchrun", line 8, in
sys.exit(main())
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/zdbp/PengJian/GLM-130B-main/generate.py FAILED
Failures:
[1]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 209423)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 209424)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 209425)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 209426)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 209422)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: