vLLM request crashes if `cfg` is specified

### Describe the issue as clearly as possible:

vLLM request crashes with `500 Internal Server Error` if the `cfg` request parameter is specified.

### Steps/code to reproduce the bug:

Follow the vLLM tutorial:
https://outlines-dev.github.io/outlines/reference/vllm/

Since the `outline[serve]` installs old versions, install vLLM 0.2.6 and latest outline, plus pydantic 2.0 using pip. (See the actual list of package versions below.)

Host this model (24GB VRAM): `deepseek-ai/deepseek-coder-6.7b-instruct`

Command:

```sh
python -O -u -m outlines.serve.serve \
  --model=deepseek-ai/deepseek-coder-6.7b-instruct \
  --host=127.0.0.1 \
  --port=8000 \
  --max-model-len=16384 \
  --max-num-seqs=16 \
  --swap-space=8 \
  --gpu-memory-utilization=0.95
```

Request URL: `http://127.0.0.1:8000/generate`

Send a request with this body: 

```json
{
  "prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list.\n\n### Response:\n",
  "n": 1,
  "best_of": 1,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "repetition_penalty": 1.0,
  "temperature": 0.0,
  "top_p": 1.0,
  "top_k": -1,
  "min_p": 0.0,
  "use_beam_search": false,
  "length_penalty": 1.0,
  "early_stopping": false,
  "stop": [],
  "stop_token_ids": [],
  "include_stop_str_in_output": false,
  "ignore_eos": false,
  "max_tokens": 50,
  "logprobs": null,
  "prompt_logprobs": null,
  "skip_special_tokens": true,
  "spaces_between_special_tokens": true,
  "cfg": "\\\n?start: DIGIT+ ( \",\" DIGIT+ )* _WS?\n%import common.DIGIT\n%import common.WS -> _WS\n"
}
```

The above request will crash with 500 Internal Server Error, see the server side exception below.

The same grammar has been tested OK with Lark without vLLM. The grammar includes optional white-space at the end, because this model seems to require it to stop.
```


### Expected result:

```shell
2,3,5,7,11,13,17,19,23,29
```

Also, please rename the `cfg` request parameter to `grammar` or `lark_grammar`, thanks!


### Error message:

```shell
INFO:     192.168.1.70:56873 - "POST /generate HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 762, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 782, in app
    await route.handle(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/fastapi/routing.py", line 299, in app
    raise e
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/outlines/serve/serve.py", line 75, in generate
    sampling_params = SamplingParams(
TypeError: SamplingParams.__init__() got an unexpected keyword argument 'cfg'
```


### Outlines/Python version information:

```python
Package                   Version
------------------------- ------------
accelerate                0.26.1
aiofiles                  23.2.1
aiohttp                   3.9.1
aioprometheus             23.12.0
aiosignal                 1.3.1
altair                    5.2.0
annotated-types           0.6.0
anyio                     4.2.0
appdirs                   1.4.4
asttokens                 2.4.1
async-timeout             4.0.3
attrs                     23.2.0
beartype                  0.16.4
certifi                   2023.11.17
charset-normalizer        3.3.2
click                     8.1.7
cloudpickle               3.0.0
contourpy                 1.2.0
cycler                    0.12.1
docker-pycreds            0.4.0
exceptiongroup            1.2.0
fastapi                   0.109.0
ffmpy                     0.3.1
filelock                  3.13.1
fonttools                 4.47.2
frozenlist                1.4.1
fschat                    0.2.3
fsspec                    2023.12.2
gitdb                     4.0.11
GitPython                 3.1.41
gradio                    3.23.0
h11                       0.14.0
httpcore                  1.0.2
httptools                 0.6.1
httpx                     0.26.0
huggingface-hub           0.20.2
icontract                 2.6.6
idna                      3.6
interegular               0.3.3
Jinja2                    3.1.3
joblib                    1.3.2
jsonschema                4.20.0
jsonschema-specifications 2023.12.1
kiwisolver                1.4.5
lark                      1.1.9
linkify-it-py             2.0.2
llvmlite                  0.41.1
markdown-it-py            2.2.0
markdown2                 2.4.12
MarkupSafe                2.1.3
matplotlib                3.8.2
mdit-py-plugins           0.3.3
mdurl                     0.1.2
mpmath                    1.3.0
msgpack                   1.0.7
multidict                 6.0.4
nest-asyncio              1.5.8
networkx                  3.2.1
ninja                     1.11.1.1
numba                     0.58.1
numpy                     1.26.3
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         8.9.2.26
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-nccl-cu12          2.18.1
nvidia-nvjitlink-cu12     12.3.101
nvidia-nvtx-cu12          12.1.105
orjson                    3.9.10
outlines                  0.0.23
packaging                 23.2
pandas                    2.1.4
perscache                 0.6.1
pillow                    10.2.0
pip                       22.0.2
prompt-toolkit            3.0.43
protobuf                  4.25.2
psutil                    5.9.7
pyarrow                   14.0.2
pydantic                  2.5.3
pydantic_core             2.14.6
pydub                     0.25.1
Pygments                  2.17.2
pyparsing                 3.1.1
python-dateutil           2.8.2
python-dotenv             1.0.0
python-multipart          0.0.6
pytz                      2023.3.post1
PyYAML                    6.0.1
quantile-python           1.1
ray                       2.9.0
referencing               0.32.1
regex                     2023.12.25
requests                  2.31.0
rich                      13.7.0
rpds-py                   0.17.1
safetensors               0.4.1
scipy                     1.11.4
semantic-version          2.10.0
sentencepiece             0.1.99
sentry-sdk                1.39.2
setproctitle              1.3.3
setuptools                59.6.0
shortuuid                 1.0.11
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.0
starlette                 0.35.1
svgwrite                  1.4.3
sympy                     1.12
tokenizers                0.15.0
toolz                     0.12.0
torch                     2.1.2
tqdm                      4.66.1
transformers              4.36.2
triton                    2.1.0
typing_extensions         4.9.0
tzdata                    2023.4
uc-micro-py               1.0.2
urllib3                   2.1.0
uvicorn                   0.25.0
uvloop                    0.19.0
vllm                      0.2.6
wandb                     0.16.2
watchfiles                0.21.0
wavedrom                  2.0.3.post3
wcwidth                   0.2.13
websockets                12.0
xformers                  0.0.23.post1
yarl                      1.9.4
```


### Context for the issue:

Due to this bug it is not possible to use a grammar with vLLM at all.

I can only use vLLM, because it has the best throughput and correctness. The GBNF grammar seems to have problems in llama.cpp and the througput is ~10x worse.

Grammar support would help avoid repeated queries due to non-conformance of LLM output, which is an essential efficiency improvement in most real-world tasks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vLLM request crashes if `cfg` is specified #534

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Error message:

Outlines/Python version information:

Context for the issue:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vLLM request crashes if cfg is specified #534

Description

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Error message:

Outlines/Python version information:

Context for the issue:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

vLLM request crashes if `cfg` is specified #534