-
Notifications
You must be signed in to change notification settings - Fork 645
Description
Describe the issue as clearly as possible:
vLLM request crashes with 500 Internal Server Error if the cfg request parameter is specified.
Steps/code to reproduce the bug:
Follow the vLLM tutorial:
https://outlines-dev.github.io/outlines/reference/vllm/
Since the outline[serve] installs old versions, install vLLM 0.2.6 and latest outline, plus pydantic 2.0 using pip. (See the actual list of package versions below.)
Host this model (24GB VRAM): deepseek-ai/deepseek-coder-6.7b-instruct
Command:
python -O -u -m outlines.serve.serve \
--model=deepseek-ai/deepseek-coder-6.7b-instruct \
--host=127.0.0.1 \
--port=8000 \
--max-model-len=16384 \
--max-num-seqs=16 \
--swap-space=8 \
--gpu-memory-utilization=0.95Request URL: http://127.0.0.1:8000/generate
Send a request with this body:
{
"prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list.\n\n### Response:\n",
"n": 1,
"best_of": 1,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"repetition_penalty": 1.0,
"temperature": 0.0,
"top_p": 1.0,
"top_k": -1,
"min_p": 0.0,
"use_beam_search": false,
"length_penalty": 1.0,
"early_stopping": false,
"stop": [],
"stop_token_ids": [],
"include_stop_str_in_output": false,
"ignore_eos": false,
"max_tokens": 50,
"logprobs": null,
"prompt_logprobs": null,
"skip_special_tokens": true,
"spaces_between_special_tokens": true,
"cfg": "\\\n?start: DIGIT+ ( \",\" DIGIT+ )* _WS?\n%import common.DIGIT\n%import common.WS -> _WS\n"
}The above request will crash with 500 Internal Server Error, see the server side exception below.
The same grammar has been tested OK with Lark without vLLM. The grammar includes optional white-space at the end, because this model seems to require it to stop.
### Expected result:
```shell
2,3,5,7,11,13,17,19,23,29
Also, please rename the cfg request parameter to grammar or lark_grammar, thanks!
Error message:
INFO: 192.168.1.70:56873 - "POST /generate HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/viktor/env/outlines/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/viktor/env/outlines/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/home/viktor/env/outlines/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 762, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 782, in app
await route.handle(scope, receive, send)
File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/home/viktor/env/outlines/lib/python3.10/site-packages/fastapi/routing.py", line 299, in app
raise e
File "/home/viktor/env/outlines/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app
raw_response = await run_endpoint_function(
File "/home/viktor/env/outlines/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/home/viktor/env/outlines/lib/python3.10/site-packages/outlines/serve/serve.py", line 75, in generate
sampling_params = SamplingParams(
TypeError: SamplingParams.__init__() got an unexpected keyword argument 'cfg'Outlines/Python version information:
Package Version
------------------------- ------------
accelerate 0.26.1
aiofiles 23.2.1
aiohttp 3.9.1
aioprometheus 23.12.0
aiosignal 1.3.1
altair 5.2.0
annotated-types 0.6.0
anyio 4.2.0
appdirs 1.4.4
asttokens 2.4.1
async-timeout 4.0.3
attrs 23.2.0
beartype 0.16.4
certifi 2023.11.17
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 3.0.0
contourpy 1.2.0
cycler 0.12.1
docker-pycreds 0.4.0
exceptiongroup 1.2.0
fastapi 0.109.0
ffmpy 0.3.1
filelock 3.13.1
fonttools 4.47.2
frozenlist 1.4.1
fschat 0.2.3
fsspec 2023.12.2
gitdb 4.0.11
GitPython 3.1.41
gradio 3.23.0
h11 0.14.0
httpcore 1.0.2
httptools 0.6.1
httpx 0.26.0
huggingface-hub 0.20.2
icontract 2.6.6
idna 3.6
interegular 0.3.3
Jinja2 3.1.3
joblib 1.3.2
jsonschema 4.20.0
jsonschema-specifications 2023.12.1
kiwisolver 1.4.5
lark 1.1.9
linkify-it-py 2.0.2
llvmlite 0.41.1
markdown-it-py 2.2.0
markdown2 2.4.12
MarkupSafe 2.1.3
matplotlib 3.8.2
mdit-py-plugins 0.3.3
mdurl 0.1.2
mpmath 1.3.0
msgpack 1.0.7
multidict 6.0.4
nest-asyncio 1.5.8
networkx 3.2.1
ninja 1.11.1.1
numba 0.58.1
numpy 1.26.3
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
orjson 3.9.10
outlines 0.0.23
packaging 23.2
pandas 2.1.4
perscache 0.6.1
pillow 10.2.0
pip 22.0.2
prompt-toolkit 3.0.43
protobuf 4.25.2
psutil 5.9.7
pyarrow 14.0.2
pydantic 2.5.3
pydantic_core 2.14.6
pydub 0.25.1
Pygments 2.17.2
pyparsing 3.1.1
python-dateutil 2.8.2
python-dotenv 1.0.0
python-multipart 0.0.6
pytz 2023.3.post1
PyYAML 6.0.1
quantile-python 1.1
ray 2.9.0
referencing 0.32.1
regex 2023.12.25
requests 2.31.0
rich 13.7.0
rpds-py 0.17.1
safetensors 0.4.1
scipy 1.11.4
semantic-version 2.10.0
sentencepiece 0.1.99
sentry-sdk 1.39.2
setproctitle 1.3.3
setuptools 59.6.0
shortuuid 1.0.11
six 1.16.0
smmap 5.0.1
sniffio 1.3.0
starlette 0.35.1
svgwrite 1.4.3
sympy 1.12
tokenizers 0.15.0
toolz 0.12.0
torch 2.1.2
tqdm 4.66.1
transformers 4.36.2
triton 2.1.0
typing_extensions 4.9.0
tzdata 2023.4
uc-micro-py 1.0.2
urllib3 2.1.0
uvicorn 0.25.0
uvloop 0.19.0
vllm 0.2.6
wandb 0.16.2
watchfiles 0.21.0
wavedrom 2.0.3.post3
wcwidth 0.2.13
websockets 12.0
xformers 0.0.23.post1
yarl 1.9.4Context for the issue:
Due to this bug it is not possible to use a grammar with vLLM at all.
I can only use vLLM, because it has the best throughput and correctness. The GBNF grammar seems to have problems in llama.cpp and the througput is ~10x worse.
Grammar support would help avoid repeated queries due to non-conformance of LLM output, which is an essential efficiency improvement in most real-world tasks.