Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: trying to run vllm inference behind the fastapi's server, but it stucks #3747

Closed
sigridjineth opened this issue Mar 30, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@sigridjineth
Copy link

sigridjineth commented Mar 30, 2024

Your current environment

A100 x 8, ubuntu

🐛 Describe the bug

hello, I am trying to run vllm inference behind the fastapi's server, but it stucks at Using model weights format ['*.safetensors']. Are there anyone experiencing such a case?

2024-03-31 02:05:20,110 INFO sqlalchemy.engine.Engine COMMIT
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8009 (Press CTRL+C to quit)
2024-03-31 02:05:21,902 INFO worker.py:1752 -- Started a local Ray instance.
/home/sionic/sigrid/logickor-pipeline/logickor_uv_pipeline/services/generator.py:21: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  SINGLE_TURN_TEMPLATE, DOUBLE_TURN_TEMPLATE = df_config[0], df_config[1]
INFO 03-31 02:05:23 config.py:433] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved.
2024-03-31 02:05:23,253 INFO worker.py:1585 -- Calling ray.init() again after it has already been called.
INFO 03-31 02:05:23 llm_engine.py:87] Initializing an LLM engine with config: model='maywell/Synatra-kiqu-7B', tokenizer='maywell/Synatra-kiqu-7B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 03-31 02:05:57 weight_utils.py:163] Using model weights format ['*.safetensors']
(RayWorkerVllm pid=872183) INFO 03-31 02:05:57 weight_utils.py:163] Using model weights format ['*.safetensors']

The code I am using is like the below.

@asynccontextmanager
async def lifespan(app: FastAPI):
    background_task = asyncio.create_task(start_background_process())
    await create_db_and_tables()
    yield
    background_task.cancel()
    try:
        await background_task
    except asyncio.CancelledError:
        pass
    await close_db()

async def start_background_process():
    while True:
        async with AsyncSessionLocal() as session:
            try:
                await process_evaluation_requests(session)
            except Exception as e:
                print(f"Error processing request: {e}")
            finally:
                await asyncio.sleep(1)

-----------------------------------------
import asyncio

from sqlmodel import select
from sqlmodel.ext.asyncio.session import AsyncSession

from logickor_uv_pipeline.models.evaluation.request import Evaluation
from logickor_uv_pipeline.services.generator import generate


async def process_evaluation_requests(session: AsyncSession):
    output_path = "./output/generate"
    while True:
        async with session.begin():
            statement = select(Evaluation).where(Evaluation.status == "pending")
            results = await session.execute(statement)
            requests = results.scalars().all()

            for request in requests:
                try:
                    output_file_name = await generate(request, output_path)
                    request.status = "success" if output_file_name else "failed"
                except Exception as e:
                    print(str(e))
                    request.status = "failed"

        await session.commit()
        await asyncio.sleep(10)

-----------------------------------------------------------------------
import os

import pandas as pd
from vllm import LLM, SamplingParams
import ray


async def generate(request, output_path):
    try:
        # Check if Ray is initialized; if not, initialize Ray
        if not ray.is_initialized():
            ray.init()

        os.environ["CUDA_VISIBLE_DEVICES"] = "4,5,6,7"
        gpu_counts = len("4,5,6,7".split(","))

        df_config = pd.read_json(
            "./logickor_uv_pipeline/services/LogicKor/templates/template-EEVE.json",
            typ="series",
        )
        SINGLE_TURN_TEMPLATE, DOUBLE_TURN_TEMPLATE = df_config[0], df_config[1]

        llm = LLM(
            model=request.model_name,
            tensor_parallel_size=gpu_counts,
            max_model_len=4096,
            gpu_memory_utilization=0.8,
        )
        sampling_params = SamplingParams(
            temperature=0,
            top_p=1,
            top_k=-1,
            early_stopping=True,
            best_of=4,
            use_beam_search=True,
            skip_special_tokens=False,
            max_tokens=4096,
            stop=["", "</s>", "", "[INST]", "[/INST]"],
        )

        df_questions = pd.read_json(
            "./logickor_uv_pipeline/services/LogicKor/questions.jsonl", lines=True
        )

        def format_single_turn_question(question):
            return SINGLE_TURN_TEMPLATE.format(question=question)

        single_turn_questions = df_questions["question"].map(format_single_turn_question)
        single_turn_outputs = [
            output.outputs[0].text.strip()
            for output in await llm.generate(
                single_turn_questions.tolist(), sampling_params
            )
        ]

        def format_double_turn_question(question, single_turn_output):
            return DOUBLE_TURN_TEMPLATE.format(
                question=question, single_turn_output=single_turn_output
            )

        multi_turn_questions = [
            format_double_turn_question(question, single_turn_outputs[idx])
            for idx, question in enumerate(df_questions["question"])
        ]

        multi_turn_outputs = [
            output.outputs[0].text.strip()
            for output in await llm.generate(multi_turn_questions, sampling_params)
        ]

        df_output = pd.DataFrame(
            {
                "id": df_questions["id"],
                "category": df_questions["category"],
                "question": df_questions["question"],
                "single_turn_output": single_turn_outputs,
                "multi_turn_output": multi_turn_outputs,
                "reference": df_questions["reference"],
            }
        )

        output_file_name = f"{request.model_name.replace('/', '_')}.jsonl"

        df_output.to_json(
            os.path.join(output_path, output_file_name),
            orient="records",
            lines=True,
            force_ascii=False,
        )
        return output_file_name

    except Exception as e:
        # Handle any errors here
        print(f"An error occurred: {e}")
        raise e
    finally:
        # Ensure Ray is shut down to prevent issues with reinitialization
        if ray.is_initialized():
            ray.shutdown()
@sigridjineth sigridjineth added the bug Something isn't working label Mar 30, 2024
@youkaichao
Copy link
Member

Hi, please paste your environment using https://github.com/vllm-project/vllm/blob/main/collect_env.py , so that we can help you better.

@sigridjineth
Copy link
Author

sigridjineth commented Mar 30, 2024

@youkaichao I tried to run it before but got this error - can you help me out

(sionic) sionic@iZmj7ir0ircgij46j89st9Z:~/sigrid/vllm$ python ./collect_env.py
Collecting environment information...
Traceback (most recent call last):
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 719, in <module>
    main()
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 698, in main
    output = get_pretty_env_info()
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 693, in get_pretty_env_info
    return pretty_str(get_env_info())
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 499, in get_env_info
    pip_version, pip_list_output = get_pip_packages(run_lambda)
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 469, in get_pip_packages
    out = run_with_pip([sys.executable, '-mpip'])
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 465, in run_with_pip
    return "\n".join(line for line in out.splitlines()
AttributeError: 'NoneType' object has no attribute 'splitlines'

@youkaichao
Copy link
Member

This is strange. Your environment might be broken. What happens when you manually execute python -mpip list --format=freeze ?

@sigridjineth
Copy link
Author

sigridjineth commented Mar 31, 2024

@youkaichao I am using uv manager, which is Rust-based python package manager.

and here's the uv pip freeze:

(logickor-pipeline) sionic@iZmj7ir0ircgij46j89st9Z:~/sigrid/logickor-pipeline$ uv pip freeze
aiosignal==1.3.1
aiosqlite==0.20.0
annotated-types==0.6.0
anyio==4.3.0
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
cupy-cuda12x==12.1.0
diskcache==5.6.3
distro==1.9.0
dnspython==2.6.1
email-validator==2.1.1
exceptiongroup==1.2.0
fastapi==0.110.0
fastrlock==0.8.2
filelock==3.13.3
frozenlist==1.4.1
fsspec==2024.3.1
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.22.2
idna==3.6
interegular==0.3.3
isort==5.13.2
jinja2==3.1.3
joblib==1.3.2
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
lark==1.1.9
llvmlite==0.42.0
markupsafe==2.1.5
mpmath==1.3.0
msgpack==1.0.8
nest-asyncio==1.6.0
networkx==3.2.1
ninja==1.11.1.1
numba==0.59.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
openai==1.14.3
outlines==0.0.37
packaging==24.0
pandas==2.2.1
prometheus-client==0.20.0
protobuf==5.26.1
psutil==5.9.8
pydantic==2.6.4
pydantic-core==2.16.3
pydantic-settings==2.2.1
pynvml==11.5.0
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.1
pyyaml==6.0.1
ray==2.10.0
referencing==0.34.0
regex==2023.12.25
requests==2.31.0
rpds-py==0.18.0
ruff==0.3.4
safetensors==0.4.2
scipy==1.12.0
sentencepiece==0.2.0
six==1.16.0
sniffio==1.3.1
sqlalchemy==2.0.29
sqlmodel==0.0.16
starlette==0.36.3
sympy==1.12
tokenizers==0.15.2
torch==2.1.2
tqdm==4.66.2
transformers==4.39.2
triton==2.1.0
typing-extensions==4.10.0
tzdata==2024.1
urllib3==2.2.1
uvicorn==0.29.0
uvloop==0.19.0
vllm==0.3.3
watchfiles==0.21.0
websockets==12.0
xformers==0.0.23.post1

@sigridjineth
Copy link
Author

@youkaichao this issue happens same in Docker container

@youkaichao
Copy link
Member

I don't know if uv is supported by vllm (most likely no). I would recommend using conda instead.

@sigridjineth
Copy link
Author

@youkaichao uv uses virtualenv under the hood, so you mean only conda is supported for vllm library?

@youkaichao
Copy link
Member

I would say conda is the most tested, and I wouldn't be surprised if virtualenv or uv does not work for vllm.

@sigridjineth
Copy link
Author

okay, are there anyone trying to run vllm in docker settings?

I encountered the most of my times dealing with An error occurred: NCCLBackend is not available. Please install cupy. when initializing llm instance in docker container.

@youkaichao
Copy link
Member

First I suggest you switch to conda , the problem might be improper package management and some dependency like cupy is corrupted.

Second, which version of vllm do you use? We recently removed the cupy dependency , and also released v0.4.0 . You can try the new version.

@sigridjineth
Copy link
Author

@youkaichao okay, will try new version.

@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Apr 1, 2024

@sigridjineth Just curious - why not run just with the vllm api server as opposed to rebuilding your own?

The API server code you have written is not the right way to use the LLM class. In your /generate method, you are creating a whole new instance of an LLM, which [loads the models weights from disk, runs the profiler steps to see how much memory there is, allocates the full KV cache, etc]. Since each request is passed to generate, you will have a long time for each request :)

The way our API server works is that we [ load the models weights from disk, runs the profiler steps to see how much memory there is, allocates the full KV cache ] once, then during inference time we use this state. If you really do need to build an API server yourself rather than using the interfaces we provide, I would suggest looking in vllm/entrypoints/api_server.py for inspiration on how to do things properly

But you should have a very good reason for remaking this yourself

@tsvisab
Copy link

tsvisab commented May 3, 2024

Hey @sigridjineth , regarding you "stuck init" issue, how are you starting your container?
are you by any chance running the container using Sagemaker or Vertex AI?
in any case, i would guess that you are probably lacking shared memory for gpus inter communication
so if you start the docker directly, run it with --shm-size="SOME_SIZEgb",
also, make sure that container has enough storage for downloading the model shards, using VLLM you can do:

model = LLM(.. ,download_dir="/dev/shm/cache/some_sub_dir_name_if_you_wish",)

and, if it still fails, before you load the model, add:

        ray_tmp_dir = "/dev/shm/tmp/ray"
        os.makedirs(ray_tmp_dir, exist_ok=True)
        ray.init(_temp_dir=ray_tmp_dir, num_gpus=model_config.tensor_parallel_size)

@DarkLight1337
Copy link
Member

We have added documentation for this situation in #5430. Please take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants