Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Alignbench无法使用VLLM模型评测,eval阶段卡住并报错 #1298

Open
2 tasks done
IcyFeather233 opened this issue Jul 8, 2024 · 17 comments
Open
2 tasks done

Comments

@IcyFeather233
Copy link
Contributor

先决条件

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

opencompass 0.2.6
Ubuntu 20.04
python 3.10.14

重现问题 - 代码/配置示例

config 文件:

from mmengine.config import read_base

with read_base():
    from .datasets.subjective.alignbench.alignbench_judgeby_critiquellm import alignbench_datasets

from opencompass.models import HuggingFaceCausalLM, HuggingFace, HuggingFaceChatGLM3, OpenAI
from opencompass.models.openai_api import OpenAIAllesAPIN
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import AlignmentBenchSummarizer

# -------------Inference Stage ----------------------------------------
# For subjective evaluation, we often set do sample for models
from opencompass.models import VLLM

_meta_template = dict(
    round=[
        dict(role="HUMAN", begin='<|im_start|>user\n', end='<|im_end|>\n'),
        dict(role="BOT", begin="<|im_start|>assistant\n", end='<|im_end|>\n', generate=True),
    ],
    eos_token_id=151645,
)

GPU_NUMS = 4


stop_list = ['<|im_end|>', '</s>', '<|endoftext|>']

models = [
    dict(
        type=VLLM,
        abbr='xxx',
        path='xxx',
        model_kwargs=dict(tensor_parallel_size=GPU_NUMS, disable_custom_all_reduce=True, enforce_eager=True),
        meta_template=_meta_template,
        max_out_len=1024,
        max_seq_len=2048,
        batch_size=GPU_NUMS * 8,
        generation_kwargs=dict(temperature=0.1, top_p=0.9, skip_special_tokens=False, stop=stop_list),
        stop_words=stop_list,
        run_cfg=dict(num_gpus=GPU_NUMS, num_procs=1),
    )
]

datasets = [*alignbench_datasets]

# -------------Evalation Stage ----------------------------------------

## ------------- JudgeLLM Configuration


api_meta_template = dict(
    round=[
            dict(role='HUMAN', api_role='HUMAN'),
            dict(role='BOT', api_role='BOT', generate=True),
    ],
)

judge_models = [
    dict(
        type=VLLM,
        abbr='CritiqueLLM',
        path='/xxx/models/CritiqueLLM',
        model_kwargs=dict(tensor_parallel_size=GPU_NUMS, disable_custom_all_reduce=True, enforce_eager=True),
        meta_template=_meta_template,
        max_out_len=1024,
        max_seq_len=2048,
        batch_size=GPU_NUMS * 8,
        generation_kwargs=dict(temperature=0.1, top_p=0.9, skip_special_tokens=False, stop=stop_list),
        run_cfg=dict(num_gpus=GPU_NUMS, num_procs=1),
    )
]

## ------------- Evaluation Configuration
eval = dict(
    partitioner=dict(type=SubjectiveNaivePartitioner, models=models, judge_models=judge_models),
    runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
)

summarizer = dict(type=AlignmentBenchSummarizer)

work_dir = 'outputs/alignment_bench/'

重现问题 - 命令或脚本

python run.py configs/eval_xxx.py --debug --dump-eval-details

重现问题 - 错误信息

第一次报错了,第二次我使用 -m eval -r xxx 复用之前的 prediction 结果,单独运行 eval 还是报下面的错

07/08 21:37:23 - OpenCompass - INFO - Reusing experiements from 20240708_211011
07/08 21:37:23 - OpenCompass - INFO - Current exp folder: outputs/alignment_bench/20240708_211011
07/08 21:37:23 - OpenCompass - DEBUG - Modules of opencompass's partitioner registry have been automatically imported from opencompass.partitioners
07/08 21:37:23 - OpenCompass - DEBUG - Get class `SubjectiveNaivePartitioner` from "partitioner" registry in "opencompass"
07/08 21:37:23 - OpenCompass - DEBUG - An `SubjectiveNaivePartitioner` instance is built from registry, and its implementation can be found in opencompass.partitioners.sub_naive
07/08 21:37:23 - OpenCompass - DEBUG - Key eval.runner.task.judge_cfg not found in config, ignored.
07/08 21:37:23 - OpenCompass - DEBUG - Key eval.given_pred not found in config, ignored.
07/08 21:37:23 - OpenCompass - DEBUG - Additional config: {'eval': {'runner': {'task': {'dump_details': True}}}}
07/08 21:37:23 - OpenCompass - INFO - Partitioned into 1 tasks.
07/08 21:37:23 - OpenCompass - DEBUG - Task 0: [firefly_qw14b_chat_self_build_rl_dpo_full_b06_240705/alignment_bench]
07/08 21:37:23 - OpenCompass - DEBUG - Modules of opencompass's runner registry have been automatically imported from opencompass.runners
07/08 21:37:23 - OpenCompass - DEBUG - Get class `LocalRunner` from "runner" registry in "opencompass"
07/08 21:37:23 - OpenCompass - DEBUG - An `LocalRunner` instance is built from registry, and its implementation can be found in opencompass.runners.local
07/08 21:37:23 - OpenCompass - DEBUG - Modules of opencompass's task registry have been automatically imported from opencompass.tasks
07/08 21:37:23 - OpenCompass - DEBUG - Get class `SubjectiveEvalTask` from "task" registry in "opencompass"
07/08 21:37:23 - OpenCompass - DEBUG - An `SubjectiveEvalTask` instance is built from registry, and its implementation can be found in opencompass.tasks.subjective_eval
Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp-a34b3233.so.1 library.
	Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
07/08 21:37:51 - OpenCompass - INFO - No postprocessor found.
2024-07-08 21:37:55,725	INFO worker.py:1743 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
INFO 07-08 21:37:59 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/maindata/data/shared/Security-SFT/cmz/models/CritiqueLLM', speculative_config=None, tokenizer='/maindata/data/shared/Security-SFT/cmz/models/CritiqueLLM', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/maindata/data/shared/Security-SFT/cmz/models/CritiqueLLM)
WARNING 07-08 21:38:00 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
(pid=2330) Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp-a34b3233.so.1 library.
(pid=2330) 	Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
(pid=3478) Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp-a34b3233.so.1 library.
(pid=3478) 	Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
(pid=3565) Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp-a34b3233.so.1 library.
(pid=3565) 	Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
(pid=3652) Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp-a34b3233.so.1 library.
(pid=3652) 	Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
INFO 07-08 21:38:30 utils.py:660] Found nccl from library /usr/lib/x86_64-linux-gnu/libnccl.so.2
(RayWorkerWrapper pid=3478) INFO 07-08 21:38:30 utils.py:660] Found nccl from library /usr/lib/x86_64-linux-gnu/libnccl.so.2
INFO 07-08 21:38:30 selector.py:27] Using FlashAttention-2 backend.
(RayWorkerWrapper pid=3478) INFO 07-08 21:38:36 selector.py:27] Using FlashAttention-2 backend.
(RayWorkerWrapper pid=3652) INFO 07-08 21:38:30 utils.py:660] Found nccl from library /usr/lib/x86_64-linux-gnu/libnccl.so.2 [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
【卡在这里非常久,然后报下面的错】
[E socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (10.0.11.17, 44169).
ERROR 07-08 21:48:35 worker_base.py:145] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 07-08 21:48:35 worker_base.py:145] Traceback (most recent call last):
ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
ERROR 07-08 21:48:35 worker_base.py:145]     return executor(*args, **kwargs)
ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in init_device
ERROR 07-08 21:48:35 worker_base.py:145]     init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 288, in init_worker_distributed_environment
ERROR 07-08 21:48:35 worker_base.py:145]     init_distributed_environment(parallel_config.world_size, rank,
ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 70, in init_distributed_environment
ERROR 07-08 21:48:35 worker_base.py:145]     torch.distributed.init_process_group(
ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
ERROR 07-08 21:48:35 worker_base.py:145]     return func(*args, **kwargs)
ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
ERROR 07-08 21:48:35 worker_base.py:145]     func_return = func(*args, **kwargs)
ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group
ERROR 07-08 21:48:35 worker_base.py:145]     store, rank, world_size = next(rendezvous_iterator)
ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler
ERROR 07-08 21:48:35 worker_base.py:145]     store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv)
ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 170, in _create_c10d_store
ERROR 07-08 21:48:35 worker_base.py:145]     tcp_store = TCPStore(hostname, port, world_size, False, timeout)
ERROR 07-08 21:48:35 worker_base.py:145] torch.distributed.DistNetworkError: The client socket has timed out after 600s while trying to connect to (10.0.11.17, 44169).
Traceback (most recent call last):
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] Traceback (most recent call last):
  File "/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/tasks/subjective_eval.py", line 450, in <module>
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in init_device
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]     init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 288, in init_worker_distributed_environment
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]     init_distributed_environment(parallel_config.world_size, rank,
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 70, in init_distributed_environment
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]     torch.distributed.init_process_group(
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]     return func(*args, **kwargs)
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]     func_return = func(*args, **kwargs)
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]     store, rank, world_size = next(rendezvous_iterator)
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]     store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv)
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 170, in _create_c10d_store
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145]     tcp_store = TCPStore(hostname, port, world_size, False, timeout)
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] torch.distributed.DistNetworkError: The client socket has timed out after 600s while trying to connect to (10.0.11.17, 44169).
(RayWorkerWrapper pid=3652) INFO 07-08 21:38:36 selector.py:27] Using FlashAttention-2 backend. [repeated 2x across cluster]
(RayWorkerWrapper pid=3478) [E socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (10.0.11.17, 44169).
    inferencer.run()
  File "/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/tasks/subjective_eval.py", line 94, in run
    self._score(model_cfg, dataset_cfg, eval_cfg, output_column,
  File "/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/tasks/subjective_eval.py", line 379, in _score
    icl_evaluator = ICL_EVALUATORS.build(eval_cfg['evaluator'])
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/openicl/icl_evaluator/lm_evaluator.py", line 109, in __init__
    model = build_model_from_cfg(model_cfg=judge_cfg)
  File "/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/utils/build.py", line 25, in build_model_from_cfg
    return MODELS.build(model_cfg)
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/models/vllm.py", line 37, in __init__
    self._load_model(path, model_kwargs)
  File "/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/models/vllm.py", line 60, in _load_model
    self.model = LLM(path, **model_kwargs)
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 292, in from_engine_args
    engine = cls(
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 160, in __init__
    self.model_executor = executor_class(
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 43, in _init_executor
    self._init_workers_ray(placement_group)
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 164, in _init_workers_ray
    self._run_workers("init_device")
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 234, in _run_workers
    driver_worker_output = self.driver_worker.execute_method(
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 146, in execute_method
    raise e
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
    return executor(*args, **kwargs)
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 288, in init_worker_distributed_environment
    init_distributed_environment(parallel_config.world_size, rank,
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 70, in init_distributed_environment
    torch.distributed.init_process_group(
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
    func_return = func(*args, **kwargs)
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv)
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 170, in _create_c10d_store
    tcp_store = TCPStore(hostname, port, world_size, False, timeout)
torch.distributed.DistNetworkError: The client socket has timed out after 600s while trying to connect to (10.0.11.17, 44169).
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] Error executing method init_device. This might cause deadlock in distributed execution. [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] Traceback (most recent call last): [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]     return executor(*args, **kwargs) [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in init_device [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]     init_worker_distributed_environment(self.parallel_config, self.rank, [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 288, in init_worker_distributed_environment [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]     init_distributed_environment(parallel_config.world_size, rank, [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 70, in init_distributed_environment [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]     torch.distributed.init_process_group( [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper [repeated 4x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]     return func(*args, **kwargs) [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]     func_return = func(*args, **kwargs) [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]     store, rank, world_size = next(rendezvous_iterator) [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]     store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv) [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]   File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 170, in _create_c10d_store [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145]     tcp_store = TCPStore(hostname, port, world_size, False, timeout) [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] torch.distributed.DistNetworkError: The client socket has timed out after 600s while trying to connect to (10.0.11.17, 44169). [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) [E socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (10.0.11.17, 44169). [repeated 2x across cluster]
E0708 21:48:40.958000 140381132564288 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 115) of binary: /maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/bin/python
Traceback (most recent call last):
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/tasks/subjective_eval.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-08_21:48:40
  host      : eflops16
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 115)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
07/08 21:48:41 - OpenCompass - DEBUG - Get class `AlignmentBenchSummarizer` from "partitioner" registry in "opencompass"
07/08 21:48:41 - OpenCompass - DEBUG - An `AlignmentBenchSummarizer` instance is built from registry, and its implementation can be found in opencompass.summarizers.subjective.alignmentbench
outputs/alignment_bench/20240708_211011/results/firefly_qw14b_chat_self_build_rl_dpo_full_b06_240705_judged-by--CritiqueLLM is not exist! please check!

其他信息

No response

@IcyFeather233
Copy link
Contributor Author

IcyFeather233 commented Jul 8, 2024

补充信息

infer阶段VLLM使用很正常,但是eval阶段就会卡住了,模型本身的推理也没有问题

如果不使用VLLM,在eval阶段使用HuggingFace也没有问题

另外,CritiqueLLM 已经开源,CritiqueLLM 是Alignbench论文作者自己训练的替代gpt4来评测的模型,https://github.com/thu-coai/CritiqueLLM ,希望 OpenCompass 可以把这个模型更新到 AlignBench 中

@liushz
Copy link
Collaborator

liushz commented Jul 10, 2024

感谢您的建议,我们会尽快测试该模型并支持,正常的话eval阶段vllm也是没问题的,你可以试下用其他模型在eval阶段用vllm推理,可能是这个模型跟OpenCompass的vllm不适配,我们这边也会尽快验证这个事情。

@IcyFeather233
Copy link
Contributor Author

@liushz 谢谢回复,经过我的测试,很多模型在eval阶段都无法使用vllm推理,我试过Qwen也不行

@bittersweet1999
Copy link
Collaborator

有些奇怪,eval起model的逻辑应该是和infer过程一样的,Qwen用vllm的时候是什么样的报错呢

@bittersweet1999
Copy link
Collaborator

此外critiquellm看起来也是hf格式的,直接用现有的config使用起来会遇到什么问题吗

@IcyFeather233
Copy link
Contributor Author

有些奇怪,eval起model的逻辑应该是和infer过程一样的,Qwen用vllm的时候是什么样的报错呢

一样的,也是卡住不动了

@IcyFeather233
Copy link
Contributor Author

此外critiquellm看起来也是hf格式的,直接用现有的config使用起来会遇到什么问题吗

hf用起来没有问题,缺点就是慢,所以我想用vllm试试

@bittersweet1999
Copy link
Collaborator

你vllm的版本是多少

@IcyFeather233
Copy link
Contributor Author

你vllm的版本是多少

0.4.2

@bittersweet1999
Copy link
Collaborator

问题应该出现在

self.model = LLM(path, **model_kwargs)
,
但应该是VLLM自身有奇奇怪怪的问题,参考vllm-project/vllm#4974

@hrdxwandg
Copy link

hrdxwandg commented Aug 27, 2024

我也是,主观评测时用vllm在eval阶段卡住了,报错以下信息:

08/27 15:25:57 - OpenCompass - INFO - No postprocessor found.
INFO 08-27 15:25:58 config.py:729] Defaulting to use mp for distributed inference
INFO 08-27 15:25:58 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='Qwen/Qwen2-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=Qwen/Qwen2-1.5B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 08-27 15:25:59 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-27 15:25:59 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
�[1;36m(VllmWorkerProcess pid=578709)�[0;0m INFO 08-27 15:26:02 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
�[1;36m(VllmWorkerProcess pid=578708)�[0;0m INFO 08-27 15:26:02 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
�[1;36m(VllmWorkerProcess pid=578707)�[0;0m INFO 08-27 15:26:02 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[E827 15:35:58.511112511 socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (127.0.0.1, 35639).
Traceback (most recent call last):
xxxxxxxxxxx
torch.distributed.DistNetworkError: The client socket has timed out after 600s while trying to connect to (127.0.0.1, 35639).
ERROR 08-27 15:35:59 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 578709 died, exit code: -15
INFO 08-27 15:35:59 multiproc_worker_utils.py:123] Killing local vLLM worker processes
E0827 15:36:00.383000 140006882236224 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 578492) of binary: /xxx/anaconda3/envs/xxx/bin/python
Traceback (most recent call last):

请问有解决方法么? eval阶段用hf没问题,但是太慢了。

@hrdxwandg
Copy link

@tonysy

@bittersweet1999
Copy link
Collaborator

qwen能用lmdeploy, lmdeploy不会卡的

@hrdxwandg
Copy link

qwen能用lmdeploy, lmdeploy不会卡的

试了下可以的,多谢
还是期待官方能修复下vllm的问题,业务线上用的vllm

从文档看加速比和精度还是有点差异
image

@hrdxwandg
Copy link

qwen能用lmdeploy, lmdeploy不会卡的

更正下:我在llama3.1, mistral02上测试没问题,但在Qwen2上报错了:

[TM][INFO] NCCL group_id = 0
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
E0827 22:46:43.539000 140671864051520 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -8) local_rank: 0 (pid: 942435) of binary: xxx/anaconda3/envs/llm_eval/bin/python
Traceback (most recent call last):

@tzyodear
Copy link

+1
我也是测align_bench无法用vllm的qwen2.5作为judge卡住了
手动ctrl+C显示是with ThreadPoolExecutor()那里卡住了

不太清楚该怎么解决

@bittersweet1999
Copy link
Collaborator

+1 我也是测align_bench无法用vllm的qwen2.5作为judge卡住了 手动ctrl+C显示是with ThreadPoolExecutor()那里卡住了

不太清楚该怎么解决

先用lmdeploy或者用vllm部署成API吧(vllm部署成API也比较简单就一行代码启动),local模型的vllm涉及的bug比较深,还在找原因

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants