[Bug]: trying to run vllm inference behind the fastapi's server, but it stucks #3747

sigridjineth · 2024-03-30T18:25:24Z

Your current environment

A100 x 8, ubuntu

🐛 Describe the bug

hello, I am trying to run vllm inference behind the fastapi's server, but it stucks at Using model weights format ['*.safetensors']. Are there anyone experiencing such a case?

2024-03-31 02:05:20,110 INFO sqlalchemy.engine.Engine COMMIT
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8009 (Press CTRL+C to quit)
2024-03-31 02:05:21,902 INFO worker.py:1752 -- Started a local Ray instance.
/home/sionic/sigrid/logickor-pipeline/logickor_uv_pipeline/services/generator.py:21: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  SINGLE_TURN_TEMPLATE, DOUBLE_TURN_TEMPLATE = df_config[0], df_config[1]
INFO 03-31 02:05:23 config.py:433] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved.
2024-03-31 02:05:23,253 INFO worker.py:1585 -- Calling ray.init() again after it has already been called.
INFO 03-31 02:05:23 llm_engine.py:87] Initializing an LLM engine with config: model='maywell/Synatra-kiqu-7B', tokenizer='maywell/Synatra-kiqu-7B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 03-31 02:05:57 weight_utils.py:163] Using model weights format ['*.safetensors']
(RayWorkerVllm pid=872183) INFO 03-31 02:05:57 weight_utils.py:163] Using model weights format ['*.safetensors']

The code I am using is like the below.

@asynccontextmanager
async def lifespan(app: FastAPI):
    background_task = asyncio.create_task(start_background_process())
    await create_db_and_tables()
    yield
    background_task.cancel()
    try:
        await background_task
    except asyncio.CancelledError:
        pass
    await close_db()

async def start_background_process():
    while True:
        async with AsyncSessionLocal() as session:
            try:
                await process_evaluation_requests(session)
            except Exception as e:
                print(f"Error processing request: {e}")
            finally:
                await asyncio.sleep(1)

-----------------------------------------
import asyncio

from sqlmodel import select
from sqlmodel.ext.asyncio.session import AsyncSession

from logickor_uv_pipeline.models.evaluation.request import Evaluation
from logickor_uv_pipeline.services.generator import generate


async def process_evaluation_requests(session: AsyncSession):
    output_path = "./output/generate"
    while True:
        async with session.begin():
            statement = select(Evaluation).where(Evaluation.status == "pending")
            results = await session.execute(statement)
            requests = results.scalars().all()

            for request in requests:
                try:
                    output_file_name = await generate(request, output_path)
                    request.status = "success" if output_file_name else "failed"
                except Exception as e:
                    print(str(e))
                    request.status = "failed"

        await session.commit()
        await asyncio.sleep(10)

-----------------------------------------------------------------------
import os

import pandas as pd
from vllm import LLM, SamplingParams
import ray


async def generate(request, output_path):
    try:
        # Check if Ray is initialized; if not, initialize Ray
        if not ray.is_initialized():
            ray.init()

        os.environ["CUDA_VISIBLE_DEVICES"] = "4,5,6,7"
        gpu_counts = len("4,5,6,7".split(","))

        df_config = pd.read_json(
            "./logickor_uv_pipeline/services/LogicKor/templates/template-EEVE.json",
            typ="series",
        )
        SINGLE_TURN_TEMPLATE, DOUBLE_TURN_TEMPLATE = df_config[0], df_config[1]

        llm = LLM(
            model=request.model_name,
            tensor_parallel_size=gpu_counts,
            max_model_len=4096,
            gpu_memory_utilization=0.8,
        )
        sampling_params = SamplingParams(
            temperature=0,
            top_p=1,
            top_k=-1,
            early_stopping=True,
            best_of=4,
            use_beam_search=True,
            skip_special_tokens=False,
            max_tokens=4096,
            stop=["", "</s>", "", "[INST]", "[/INST]"],
        )

        df_questions = pd.read_json(
            "./logickor_uv_pipeline/services/LogicKor/questions.jsonl", lines=True
        )

        def format_single_turn_question(question):
            return SINGLE_TURN_TEMPLATE.format(question=question)

        single_turn_questions = df_questions["question"].map(format_single_turn_question)
        single_turn_outputs = [
            output.outputs[0].text.strip()
            for output in await llm.generate(
                single_turn_questions.tolist(), sampling_params
            )
        ]

        def format_double_turn_question(question, single_turn_output):
            return DOUBLE_TURN_TEMPLATE.format(
                question=question, single_turn_output=single_turn_output
            )

        multi_turn_questions = [
            format_double_turn_question(question, single_turn_outputs[idx])
            for idx, question in enumerate(df_questions["question"])
        ]

        multi_turn_outputs = [
            output.outputs[0].text.strip()
            for output in await llm.generate(multi_turn_questions, sampling_params)
        ]

        df_output = pd.DataFrame(
            {
                "id": df_questions["id"],
                "category": df_questions["category"],
                "question": df_questions["question"],
                "single_turn_output": single_turn_outputs,
                "multi_turn_output": multi_turn_outputs,
                "reference": df_questions["reference"],
            }
        )

        output_file_name = f"{request.model_name.replace('/', '_')}.jsonl"

        df_output.to_json(
            os.path.join(output_path, output_file_name),
            orient="records",
            lines=True,
            force_ascii=False,
        )
        return output_file_name

    except Exception as e:
        # Handle any errors here
        print(f"An error occurred: {e}")
        raise e
    finally:
        # Ensure Ray is shut down to prevent issues with reinitialization
        if ray.is_initialized():
            ray.shutdown()

The text was updated successfully, but these errors were encountered:

youkaichao · 2024-03-30T18:29:33Z

Hi, please paste your environment using https://github.com/vllm-project/vllm/blob/main/collect_env.py , so that we can help you better.

sigridjineth · 2024-03-30T18:34:24Z

@youkaichao I tried to run it before but got this error - can you help me out

(sionic) sionic@iZmj7ir0ircgij46j89st9Z:~/sigrid/vllm$ python ./collect_env.py
Collecting environment information...
Traceback (most recent call last):
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 719, in <module>
    main()
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 698, in main
    output = get_pretty_env_info()
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 693, in get_pretty_env_info
    return pretty_str(get_env_info())
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 499, in get_env_info
    pip_version, pip_list_output = get_pip_packages(run_lambda)
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 469, in get_pip_packages
    out = run_with_pip([sys.executable, '-mpip'])
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 465, in run_with_pip
    return "\n".join(line for line in out.splitlines()
AttributeError: 'NoneType' object has no attribute 'splitlines'

youkaichao · 2024-03-30T21:34:27Z

This is strange. Your environment might be broken. What happens when you manually execute python -mpip list --format=freeze ?

sigridjineth · 2024-03-31T04:31:05Z

@youkaichao I am using uv manager, which is Rust-based python package manager.

and here's the uv pip freeze:

(logickor-pipeline) sionic@iZmj7ir0ircgij46j89st9Z:~/sigrid/logickor-pipeline$ uv pip freeze
aiosignal==1.3.1
aiosqlite==0.20.0
annotated-types==0.6.0
anyio==4.3.0
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
cupy-cuda12x==12.1.0
diskcache==5.6.3
distro==1.9.0
dnspython==2.6.1
email-validator==2.1.1
exceptiongroup==1.2.0
fastapi==0.110.0
fastrlock==0.8.2
filelock==3.13.3
frozenlist==1.4.1
fsspec==2024.3.1
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.22.2
idna==3.6
interegular==0.3.3
isort==5.13.2
jinja2==3.1.3
joblib==1.3.2
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
lark==1.1.9
llvmlite==0.42.0
markupsafe==2.1.5
mpmath==1.3.0
msgpack==1.0.8
nest-asyncio==1.6.0
networkx==3.2.1
ninja==1.11.1.1
numba==0.59.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
openai==1.14.3
outlines==0.0.37
packaging==24.0
pandas==2.2.1
prometheus-client==0.20.0
protobuf==5.26.1
psutil==5.9.8
pydantic==2.6.4
pydantic-core==2.16.3
pydantic-settings==2.2.1
pynvml==11.5.0
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.1
pyyaml==6.0.1
ray==2.10.0
referencing==0.34.0
regex==2023.12.25
requests==2.31.0
rpds-py==0.18.0
ruff==0.3.4
safetensors==0.4.2
scipy==1.12.0
sentencepiece==0.2.0
six==1.16.0
sniffio==1.3.1
sqlalchemy==2.0.29
sqlmodel==0.0.16
starlette==0.36.3
sympy==1.12
tokenizers==0.15.2
torch==2.1.2
tqdm==4.66.2
transformers==4.39.2
triton==2.1.0
typing-extensions==4.10.0
tzdata==2024.1
urllib3==2.2.1
uvicorn==0.29.0
uvloop==0.19.0
vllm==0.3.3
watchfiles==0.21.0
websockets==12.0
xformers==0.0.23.post1

sigridjineth · 2024-03-31T04:39:49Z

@youkaichao this issue happens same in Docker container

youkaichao · 2024-03-31T04:51:08Z

I don't know if uv is supported by vllm (most likely no). I would recommend using conda instead.

sigridjineth · 2024-03-31T04:58:50Z

@youkaichao uv uses virtualenv under the hood, so you mean only conda is supported for vllm library?

youkaichao · 2024-03-31T05:03:50Z

I would say conda is the most tested, and I wouldn't be surprised if virtualenv or uv does not work for vllm.

sigridjineth · 2024-03-31T05:05:14Z

okay, are there anyone trying to run vllm in docker settings?

I encountered the most of my times dealing with An error occurred: NCCLBackend is not available. Please install cupy. when initializing llm instance in docker container.

youkaichao · 2024-03-31T05:14:58Z

First I suggest you switch to conda , the problem might be improper package management and some dependency like cupy is corrupted.

Second, which version of vllm do you use? We recently removed the cupy dependency , and also released v0.4.0 . You can try the new version.

sigridjineth · 2024-03-31T05:19:08Z

@youkaichao okay, will try new version.

robertgshaw2-redhat · 2024-04-01T23:29:24Z

@sigridjineth Just curious - why not run just with the vllm api server as opposed to rebuilding your own?

The API server code you have written is not the right way to use the LLM class. In your /generate method, you are creating a whole new instance of an LLM, which [loads the models weights from disk, runs the profiler steps to see how much memory there is, allocates the full KV cache, etc]. Since each request is passed to generate, you will have a long time for each request :)

The way our API server works is that we [ load the models weights from disk, runs the profiler steps to see how much memory there is, allocates the full KV cache ] once, then during inference time we use this state. If you really do need to build an API server yourself rather than using the interfaces we provide, I would suggest looking in vllm/entrypoints/api_server.py for inspiration on how to do things properly

But you should have a very good reason for remaking this yourself

tsvisab · 2024-05-03T09:13:20Z

Hey @sigridjineth , regarding you "stuck init" issue, how are you starting your container?
are you by any chance running the container using Sagemaker or Vertex AI?
in any case, i would guess that you are probably lacking shared memory for gpus inter communication
so if you start the docker directly, run it with --shm-size="SOME_SIZEgb",
also, make sure that container has enough storage for downloading the model shards, using VLLM you can do:

model = LLM(.. ,download_dir="/dev/shm/cache/some_sub_dir_name_if_you_wish",)

and, if it still fails, before you load the model, add:

        ray_tmp_dir = "/dev/shm/tmp/ray"
        os.makedirs(ray_tmp_dir, exist_ok=True)
        ray.init(_temp_dir=ray_tmp_dir, num_gpus=model_config.tensor_parallel_size)

DarkLight1337 · 2024-06-13T09:03:10Z

We have added documentation for this situation in #5430. Please take a look.

sigridjineth added the bug Something isn't working label Mar 30, 2024

DarkLight1337 closed this as completed Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: trying to run vllm inference behind the fastapi's server, but it stucks #3747

[Bug]: trying to run vllm inference behind the fastapi's server, but it stucks #3747

sigridjineth commented Mar 30, 2024 •

edited

Loading

youkaichao commented Mar 30, 2024

sigridjineth commented Mar 30, 2024 •

edited

Loading

youkaichao commented Mar 30, 2024

sigridjineth commented Mar 31, 2024 •

edited

Loading

sigridjineth commented Mar 31, 2024

youkaichao commented Mar 31, 2024

sigridjineth commented Mar 31, 2024

youkaichao commented Mar 31, 2024

sigridjineth commented Mar 31, 2024

youkaichao commented Mar 31, 2024

sigridjineth commented Mar 31, 2024

robertgshaw2-redhat commented Apr 1, 2024 •

edited

Loading

tsvisab commented May 3, 2024 •

edited

Loading

DarkLight1337 commented Jun 13, 2024

[Bug]: trying to run vllm inference behind the fastapi's server, but it stucks #3747

[Bug]: trying to run vllm inference behind the fastapi's server, but it stucks #3747

Comments

sigridjineth commented Mar 30, 2024 • edited Loading

Your current environment

🐛 Describe the bug

youkaichao commented Mar 30, 2024

sigridjineth commented Mar 30, 2024 • edited Loading

youkaichao commented Mar 30, 2024

sigridjineth commented Mar 31, 2024 • edited Loading

sigridjineth commented Mar 31, 2024

youkaichao commented Mar 31, 2024

sigridjineth commented Mar 31, 2024

youkaichao commented Mar 31, 2024

sigridjineth commented Mar 31, 2024

youkaichao commented Mar 31, 2024

sigridjineth commented Mar 31, 2024

robertgshaw2-redhat commented Apr 1, 2024 • edited Loading

tsvisab commented May 3, 2024 • edited Loading

DarkLight1337 commented Jun 13, 2024

sigridjineth commented Mar 30, 2024 •

edited

Loading

sigridjineth commented Mar 30, 2024 •

edited

Loading

sigridjineth commented Mar 31, 2024 •

edited

Loading

robertgshaw2-redhat commented Apr 1, 2024 •

edited

Loading

tsvisab commented May 3, 2024 •

edited

Loading