Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: guided generation can't always finish generating the requested structure #8350

Closed
stas00 opened this issue Sep 11, 2024 · 15 comments
Closed
Labels
bug Something isn't working

Comments

@stas00
Copy link
Contributor

stas00 commented Sep 11, 2024

So it appears that guided generation returns the requested structure like json only if the model has an infinite number of tokens it can generate, but otherwise very often it fails to close the structure, e.g. if it's a simple { "key": "value" } json schema and the new tokens are limited, it'll often return { "key": "value.

I understand why this is happening - it's because the guiding can only guide when it has a subset of legal tokens to be used next, but if it's any token it can exhaust the full max_length and by the time it discovered it's unfinished it's too late to wrap up the structure.

Here is how to reproduce this problem:

from vllm import LLM, SamplingParams

model_name_or_path = "TinyLlama/TinyLlama-1.1B-Chat-v0.6"

schema = '{ "type": "object", "properties": { "age": { "type": "integer"}, "description": { "type": "string"} }, "required": ["age", "description"] }'

model = LLM(
    model=model_name_or_path,
    tokenizer=model_name_or_path,
    tokenizer_mode="auto",
    tensor_parallel_size=1,
    trust_remote_code=True,
    enforce_eager=True,
    dtype="bfloat16",
    gpu_memory_utilization=0.8,
    guided_decoding_backend="outlines",
)

prompt = "Give an example of a person's profile that fits this JSON schema: {schema}"

sampling_params = SamplingParams(
    temperature=1.0,
    seed=None,
    max_tokens=25,
)
kwargs = dict(
    sampling_params=sampling_params,
    guided_options_request=dict(
        guided_json=schema,
    ),
)

outputs = model.generate([prompt]*10, **kwargs)
for output in outputs:
    response = output.outputs[0].text
    print(response)

gives:

{"age":25, "description":"often misquoted as drinking soup and
{"age":22,"description": "My candidate profile presented here wears sunglasses,
{"age": 70, "description": "Nicky is in his 70's
{"age":14,"description":"Grandparent"}
{"age": 21,"description": "Could you tailor this schema to fit my specific industry by
{"age":22,"description":"I am a toast of the city. I am right by your side, always
{"age": 30,"description": "Fast and furious fan! Willing to work long hours to ful
{"age":5,"description":"Test user description..."}
{"age":30,"description":"I tend towards neosystem-oriented design with a
{ "age": 20, "description": "An enthusiastic and hard-working individual

It's quite obvious what the problem is here.

With lm-format-enforcer backend things are even worse.

I then went to the origin and re-wrote the same code using outlines json generator and the problem is the same:

import outlines

model_name_or_path = "TinyLlama/TinyLlama-1.1B-Chat-v0.6"

schema = '{ "type": "object", "properties": { "age": { "type": "integer"}, "description": { "type": "string"} }, "required": ["age", "description"] }'

model = outlines.models.transformers(model_name_or_path, device='cuda:0')
generator = outlines.generate.json(model, schema)

prompt = "Give an example of a person's profile that fits this JSON schema: {schema}"

for i in range(10):
    response = generator(prompt, max_tokens=25)
    print(response)

gives

$ python outlines-direct-truncation.py
Compiling FSM index for all state transitions: 100%|████████████████████████████████████████████████████| 42/42 [00:00<00:00, 162.11it/s]
{'age': 50, 'description': 'Love to doggy-couting for fun!'}
Traceback (most recent call last):
  File "/mnt/nvme0/code/contextual/core-faster-ttft/dawn/exp/infer/faster-ttft/outlines-direct-truncation.py", line 16, in <module>
    response = generator(prompt, max_tokens=25)
  File "/home/stas/anaconda3/envs/py310-pt22/lib/python3.10/site-packages/outlines/generate/api.py", line 230, in __call__
    formatted = [self.format_sequence(sequence) for sequence in stripped]
  File "/home/stas/anaconda3/envs/py310-pt22/lib/python3.10/site-packages/outlines/generate/api.py", line 230, in <listcomp>
    formatted = [self.format_sequence(sequence) for sequence in stripped]
  File "/home/stas/anaconda3/envs/py310-pt22/lib/python3.10/site-packages/outlines/generate/json.py", line 60, in <lambda>
    generator.format_sequence = lambda x: pyjson.loads(x)
  File "/home/stas/anaconda3/envs/py310-pt22/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/home/stas/anaconda3/envs/py310-pt22/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/stas/anaconda3/envs/py310-pt22/lib/python3.10/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 26 (char 25)

as you can see, it succeeds on the first request and fails to generate valid json on a second request.

Making max_tokens=50 gives more completed jsons, but some still fail.

While this example is contrived to be simple to understand, this problem can occur at any max_tokens if the greedy any character regex stage catches the model at the boundery of max_tokens.

Currently the workarounds I have been using is to make the schema more strict and define minLength and maxItems - but obviously this makes the generation far from ideal.

I also added a retry mechanism, which is very inefficient at the moment since it re-tries the whole thing. The efficient retry would be to chop off some ending and feed the now longer prompt which includes most of the generated output to the model and only get it to generate a different ending which hopefully would finish the structure.

Since the problem appears to be to close the structure, my feeling is that the correct solution is to identify the closing pattern and generate with max_tokens minus length of the closing structure and then somehow append it by changing that regex to fast-forward to the closure.

I realize the problem comes from the back-ends but perhaps we have more interested people here to discuss the correct algorithm to resolve this and then we could share the solution with the back-ends. At the very least it'd be good to document that currently guided generation in vllm is not guaranteed to work.

Using latest vllm==0.6.0 and outlines==0.0.46 here.

@stas00 stas00 added the bug Something isn't working label Sep 11, 2024
@stas00 stas00 changed the title [Bug]: guided generation can't always finish the structure [Bug]: guided generation can't always finish generating the requested structure Sep 11, 2024
@stas00
Copy link
Contributor Author

stas00 commented Sep 11, 2024

If I manipulate the schema to swap the order of age vs description so that it ends with an integer value which is a very short string:

-schema = '{ "type": "object", "properties": { "age": { "type": "integer"}, "description": { "type": "string"} }, "required": ["age", "description"] }'
+schema = '{ "type": "object", "properties": { "description": { "type": "string"}, "age": { "type": "integer"} }, "required": ["age", "description"] }'

it then works much better:

{"description":"This is a schema for an example alias using JSONSchemaURI.", "age":20}
{"description": "This user has Signin threshold set to 2.","age": 53}
{"description": "This example demonstrates how to parse a person profile.", "age": 76 }
{"description":"Person for {schema}", "age":18}
{"description":"A case-insensitive match for a regex pattern defined by the client.","age" :
{"description": "a user profile", "age": 30}
{"description" : "personal data about a person (must fit schema for this type) a plain
{"description":"john smith","age":35}
{"description": "A list of pride animals. See also: https://en.wikipedia.org/wiki/Pride

6/10 success as compared to 2/10 success with the original schema. So even the order of items in schema can make a difference.

and as I mentioned in the OP if I make the schema very strict by adding "maxLength": 15 to the free-text field:

schema = '{ "type": "object", "properties": { "description": { "type": "string", "maxLength": 15 }, "age": { "type": "integer"} }, "required": ["age", "description"] }'

I get 100% success:

{"description":"This is a long," ,"age":75}
{"description": "This user has a" , "age": 18}
{"description": "This example is" , "age": 18}
{"description":"Person for {sku", "age":20}
{"description":"A case-insensit","age":19}
{"description": "a user profile", "age": 30}
{"description" : "personal data", "age":18}
{"description":"john smith","age":35}
{"description": "A list of pride", "age": 18}
{ "description": "Member of a Chi" ,"age":25}

which works fine for a contrived example or a test, but it won't work well in the general case.

@stas00
Copy link
Contributor Author

stas00 commented Sep 11, 2024

another thought - the model has no clue of how many tokens it can use to build the output other than based on training experience, so if it was taught to make json in 512 tokens and max_new_tokens is 512 it might have some predictable size of the "canvas" but since the prompt is of a varying length could the model know when to wrap up the requested structure?

So perhaps structure-friendly models need to have some signal during training for the model to know how much space does it have for each particular prompt. or perhaps it could be instructed as in?

prompt = "blah blah"
prompt += ". You have a maximum of {min(context_length-len(prompt)), max_new_tokens)} tokens to complete the task."

?

@njhill
Copy link
Member

njhill commented Sep 11, 2024

@stas00 one idea that might help - you could additionally pass in a custom LogitsProcessor that increasingly boosts the scores of tokens that end in a terminating json character (i.e. } and ] and possibly ") as well as the EOS token, once you are getting close to max_tokens generated (e.g. within n tokens of it).

The ExpotnentialDecayLengthPenalty logits processor in transformers could be used for this, but you may need to adapt it slightly because vLLM's LogitsProcessor interface is slightly different to transformers' I think.

You would probably want to generate this list of tokens up-front based on the vocab (there may be other tokens whose string representation ends in } apart from "}" itself...)

@stas00
Copy link
Contributor Author

stas00 commented Sep 11, 2024

That's a great idea, Nick!

So I tried a simple logit processor that promotes the select few tokens to the top towards the end of the context window and it works!

The POC logit processor:

end_chars = list('"}')
end_token_ids = [tokenizer.encode(x)[1] for x in end_chars]
def json_wrap_up(input_ids, logits):
    cur_len = len(input_ids)
    if cur_len > 10:
        logits[end_token_ids] += 10

    return logits

end-to-end code:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name_or_path = "TinyLlama/TinyLlama-1.1B-Chat-v0.6"

schema = '{ "type": "object", "properties": { "age": { "type": "integer"}, "description": { "type": "string"} }, "required": ["age", "description"] }'
#schema = '{ "type": "object", "properties": { "description": { "type": "string", "maxLength": 15 }, "age": { "type": "integer"} }, "required": ["age", "description"] }'

model = LLM(
    model=model_name_or_path,
    tokenizer=model_name_or_path,
    tokenizer_mode="auto",
    tensor_parallel_size=1,
    trust_remote_code=True,
    enforce_eager=True,
    dtype="bfloat16",
    gpu_memory_utilization=0.8,
    guided_decoding_backend="outlines",
)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
end_chars = list('"}')
end_token_ids = [tokenizer.encode(x)[1] for x in end_chars]

prompt = "Give an example of a person's profile that fits this JSON schema: {schema}"

def json_wrap_up(input_ids, logits):
    cur_len = len(input_ids)
    if cur_len > 10:
        logits[end_token_ids] += 10

    return logits

sampling_params = SamplingParams(
    temperature=1.0,
    seed=None,
    max_tokens=25,
    logits_processors=[json_wrap_up]
)
kwargs = dict(
    sampling_params=sampling_params,
    guided_options_request=dict(
        guided_json=schema,
    ),
)

outputs = model.generate([prompt]*10, **kwargs)
for output in outputs:
    response = output.outputs[0].text
    print(response)

running it:

{"age":25, "description":"o " }
{"age":22,"description": "My " }
{"age": 70, "description": "Nicky " }
{"age":14,"description":"Grand " }
{"age": 21,"description": "Could " }
{"age":22,"description":"I am a " }
{"age": 30,"description": "Fast " }
{"age":5,"description":"Test user description... } " }
{"age":30,"description":" } " }
{ "age": 20, "description": " " }

all valid json endings! albeit the algorithm needs some polishing not to have } in the string and not hardcode the boundary numbers of course. And as you said probably need to include whole tokens of combined characters like "} as a single token. And of course, auto-deriving the end of schema tokens.

@stas00
Copy link
Contributor Author

stas00 commented Sep 11, 2024

ok, this seems to be quite automated:

# user input
max_tokens = 25
end_chars = list('"}')

end_token_ids = [tokenizer.encode(x)[1] for x in end_chars]
start_ending = max_tokens - len(end_chars) - 1
def json_wrap_up(input_ids, logits):
    if len(input_ids) >= start_ending:
        logits[end_token_ids] += 100
    return logits

sampling_params = SamplingParams(
    temperature=1.0,
    seed=None,
    max_tokens=max_tokens,
    logits_processors=[json_wrap_up]
)

full code:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name_or_path = "TinyLlama/TinyLlama-1.1B-Chat-v0.6"

schema = '{ "type": "object", "properties": { "age": { "type": "integer"}, "description": { "type": "string"} }, "required": ["age", "description"] }'

model = LLM(
    model=model_name_or_path,
    tokenizer=model_name_or_path,
    tokenizer_mode="auto",
    tensor_parallel_size=1,
    trust_remote_code=True,
    enforce_eager=True,
    dtype="bfloat16",
    gpu_memory_utilization=0.8,
    guided_decoding_backend="outlines",
)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
end_chars = list('"}')
end_token_ids = [tokenizer.encode(x)[1] for x in end_chars]

prompt = "Give an example of a person's profile that fits this JSON schema: {schema}"

max_tokens = 25
start_ending = max_tokens - len(end_chars) - 1 
def json_wrap_up(input_ids, logits):
    if len(input_ids) >= start_ending:
        logits[end_token_ids] += 100
    return logits

sampling_params = SamplingParams(
    temperature=1.0,
    seed=None,
    max_tokens=max_tokens,
    logits_processors=[json_wrap_up]
)
kwargs = dict(
    sampling_params=sampling_params,
    guided_options_request=dict(
        guided_json=schema,
    ),
)

outputs = model.generate([prompt]*10, **kwargs)
for output in outputs:
    response = output.outputs[0].text
    print(response)

output:

{"age":25, "description":"often misquoted as drink " }
{"age":22,"description": "My candidate profile presented here wears sung " }
{"age": 70, "description": "Nicky is in his 7 " }
{"age":14,"description":"Grandparent"}
{"age": 21,"description": "Could you tailor this schema to fit my " }
{"age":22,"description":"I am a toast of the city. I am right by your " }
{"age": 30,"description": "Fast and furious fan! Willing to work long " }
{"age":5,"description":"Test user description..."}
{"age":30,"description":"I tend towards neosystem-oriented " }
{ "age": 20, "description": "An enthusiastic and hard " }

Do you think I still need to support a multi-char token like "}, rather than force it to generate "+} tokens?

Anything else I'm missing to generalize this solution?

I think the input from the user will be the ending of the schema - in this example "}. I'm not sure how to automate the extraction of it dynamically.

@stas00
Copy link
Contributor Author

stas00 commented Sep 11, 2024

hmm, running bigger batches, it's still failing at time, so it's not foolproof. e.g.:

{"age":22,"description": "This person has a strong alove they share with their partner } } }

so I suppose I have to be even more precise and not promote all ending tokens at once, but one at a time.

@njhill
Copy link
Member

njhill commented Sep 11, 2024

@stas00 that's great that it "worked"!

My hunch is that you could improve the quality of the outputs and have it work better for more general json cases by doing some of the other things I mentioned. What if it happens to be in a json array or more nested objects/arrays rather than a string.

Rather than "forcing" it to produce these end chars when there's only a couple of slots left, having a longer "ramp down" (say 10 - 20 tokens but that is very arbitrary guess), where you increasingly boost the score of ending-type tokens over this range.

I also think that generating a larger list of such tokens would help too... rather than encoding those chars, scan the entire vocab for tokens which end with them. And probably wouldn't harm to include EOS in that list (though maybe that won't make much difference in this case).

But generalizing this approach to arbitrary non-json schemas/regexmay need a bit more thought :)

{"age":22,"description": "This person has a strong alove they share with their partner } } }

I'm a bit surprised that this was generated since it's not valid json. But using a more complete list of valid token ids again might help.

@stas00
Copy link
Contributor Author

stas00 commented Sep 11, 2024

OK, so I switched to making my own guided generation for the last few tokens, where I prescribe the one exact character to choose:

end_chars = list('"}') + [tokenizer.eos_token]
end_token_ids = [tokenizer.encode(x)[1] for x in end_chars]

max_tokens = 25
start_ending = max_tokens - len(end_chars) - 1
def json_wrap_up(input_ids, logits):
    promote_token_idx = len(input_ids) - start_ending
    if promote_token_idx >= 0:
        logits[end_token_ids[promote_token_idx]] += 100
    return logits

seems to work well now. I added a json validation at the end as I was missing things by manual inspection of the output

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
import json

model_name_or_path = "TinyLlama/TinyLlama-1.1B-Chat-v0.6"

schema = '{ "type": "object", "properties": { "age": { "type": "integer"}, "description": { "type": "string"} }, "required": ["age", "description"] }'

model = LLM(
    model=model_name_or_path,
    tokenizer=model_name_or_path,
    tokenizer_mode="auto",
    tensor_parallel_size=1,
    trust_remote_code=True,
    enforce_eager=True,
    dtype="bfloat16",
    gpu_memory_utilization=0.8,
    guided_decoding_backend="outlines",
)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
end_chars = list('"}') + [tokenizer.eos_token]
end_token_ids = [tokenizer.encode(x)[1] for x in end_chars]

prompt = "Give an example of a person's profile that fits this JSON schema: {schema}"

max_tokens = 25
start_ending = max_tokens - len(end_chars) - 1
def json_wrap_up(input_ids, logits):
    promote_token_idx = len(input_ids) - start_ending
    if promote_token_idx >= 0:
        logits[end_token_ids[promote_token_idx]] += 100
    return logits

sampling_params = SamplingParams(
    temperature=1.0,
    seed=None,
    max_tokens=max_tokens,
    logits_processors=[json_wrap_up]
)
kwargs = dict(
    sampling_params=sampling_params,
    guided_options_request=dict(
        guided_json=schema,
    ),
)

outputs = model.generate([prompt]*100, **kwargs)
for output in outputs:
    response = output.outputs[0].text
    print(response)
    # validate json
    json.loads(response)

output:

[...]
{"age":20,"description":"test"}
{"age":1, "description":"A note about ```UX User Testing " }
{"age":12,"description":"Test description for age field that can hold 12 characters."}
{"age":3, "description":"Deep sea diver, past collections " }
{"age":18,"description":"30 years old tom" }
{"age":18,"description":"inspiring and wise spokesp " }
{"age" : 25, "description" : "I am 2 " }
{"age": 35,"description": "I'm a cheerleader at the " }
{"age":65,"description":"I play the drums, compose music, and run a local " }
{"age": 36,"description":"Love all things fashion, literature and geeky things " }
{"age":2, "description":"This is a sample profile for testing purposes " }
{"age":45,"description":"Good at programming"}
{"age": 21,"description": "A software engineer who doesn't make a big deal " }
{"age":50, "description":"A very active, outgoing person! They love to " }
{"age":95,"description":"snooze for the bold and impressed. My life " }

@stas00
Copy link
Contributor Author

stas00 commented Sep 11, 2024

{"age":22,"description": "This person has a strong alove they share with their partner } } }

I'm a bit surprised that this was generated since it's not valid json. But using a more complete list of valid token ids again might help.

It's a valid JSON in a sense that it's inside a string "... and it hasn't finished it, so any character goes, in other words here we are back to the same problem we started with. As long as } ends up having higher prob than " in the original logits my promotion of all end tokens fails to coerce finishing the json structure. That's why one comment up I rewrote it to promote only one token at a time and that seems to have solved this edge case.

But if that's the way to go then I don't know how to apply your suggestion of having multiple tokens - e.g. those matching /.*}$/ - because then the same problem will happen as discussed in this comment and it'll fail to close the structure.

So my solution isn't general to other json schemas and will require the user to input the exact ending characters - and ideally we would want to derive this automatically from the schema.

@stas00
Copy link
Contributor Author

stas00 commented Sep 11, 2024

The main problem with this solution is that the generation is still chopped off wrt to the contents of the strings - it'd be nice to be able to signal to the model to wrap the sentences up. You can see what I mean:

{"age":20,"description":"test"}
{"age":1, "description":"A note about ```UX User Testing " }
{"age":12,"description":"Test description for age field that can hold 12 characters."}
{"age":3, "description":"Deep sea diver, past collections " }
{"age":18,"description":"30 years old tom" }
{"age":18,"description":"inspiring and wise spokesp " }
{"age" : 25, "description" : "I am 2 " }
{"age": 35,"description": "I'm a cheerleader at the " }
{"age":65,"description":"I play the drums, compose music, and run a local " }
{"age": 36,"description":"Love all things fashion, literature and geeky things " }
{"age":2, "description":"This is a sample profile for testing purposes " }
{"age":45,"description":"Good at programming"}
{"age": 21,"description": "A software engineer who doesn't make a big deal " }
{"age":50, "description":"A very active, outgoing person! They love to " }
{"age":95,"description":"snooze for the bold and impressed. My life " }

@njhill
Copy link
Member

njhill commented Sep 11, 2024

It's a valid JSON in a sense that it's inside a string "... and it hasn't finished it, so any character goes, in other words here we are back to the same problem we started with. As long as } ends up having higher prob than " in the original logits my promotion of all end tokens fails to coerce finishing the json structure. That's why one comment up I rewrote it to promote only one token at a time and that seems to have solved this edge case.

Ah my bad I was thinking that braces need escaping but that's not the case.

The main problem with this solution is that the generation is still chopped off wrt to the contents of the strings - it'd be nice to be able to signal to the model to wrap the sentences up. You can see what I mean:

This is where I thought that giving a longer period to wrap up and increasing the boosting factor over that time may help it finish more gracefully. e.g. boosted enough that it would choose the end string when a sentence comes to a natural end rather than starting a new sentence. And having this be an increasing multiplicative factor rather than just adding 100 as you're doing now.

I think this may work for general json schema, because by boosting these kinds of tokens you're encouraging it to close the current string/list/object and that will happen repeatedly.

I'm sure that having this integrated into the guiding logic itself would be best, but guessing that would be quite a bit more involved.

@stas00
Copy link
Contributor Author

stas00 commented Sep 11, 2024

I agree, following your suggestion I wrote a hack that enforces a valid json at a cost of abrupt ending. Surely the situation is extreme because I'm using an extremely short seqlen in my POC, and in the general case it probably won't be an issue 99% of the time. Additionally, making the schema more strict would naturally aid with the model doing the right thing w/o coercion.

As you're saying all this work should be done by either vllm or even better by the backends - as clearly this calls for some smart algorithm based on many use cases.

Should I take this next to say "outlines" Issues and see if they would see it as a problem they would want to solve? After all they promise json but clearly it's a false promise ad I have shown in the OP it's very brittle, here it is again:

import outlines

model_name_or_path = "TinyLlama/TinyLlama-1.1B-Chat-v0.6"

schema = '{ "type": "object", "properties": { "age": { "type": "integer"}, "description": { "type": "string"} }, "required": ["age", "description"] }'

model = outlines.models.transformers(model_name_or_path, device='cuda:0')
generator = outlines.generate.json(model, schema)

prompt = "Give an example of a person's profile that fits this JSON schema: {schema}"

for i in range(10):
    response = generator(prompt, max_tokens=25)
    print(response)

it fails to produce valid json for many of the tries.

But the problem is that vllm's outlines integration doesn't use the latter's generate.json API, but instead uses the regex API, so I don't think outlines's solution will make any difference to vllm's guided generation outcome. Please correct me if I'm wrong.

@stas00
Copy link
Contributor Author

stas00 commented Sep 12, 2024

As Mihai Balint mentioned on twitter, json_repair could be another approach as it overcomes the missing structure closure:

In [1]: import json_repair

In [2]: json_repair.loads('{"a":1,"b":"str"')
Out[2]: {'a': 1, 'b': 'str'}

In [3]: json_repair.loads('{"a":1,"b":"str')
Out[3]: {'a': 1, 'b': 'str'}

but it's slower than json.loads so possibly to be used in try/except and try again with json_repair.loads and if that fails then retry or use the logitprocessor?

@eByteTheDust
Copy link

I was using vllm [v0.5.0.post1] and guided generation was working great. I upgraded to vllm [v0.6.2] and the only response I get is { " sometimes when adding truncate_prompt_tokens=30, I get the initial character of my template. { "r

I tried changing max_tokens and many other parameters. Nothing works. I think around v.0.5.2, Outlines was updated to 0.0.46. v0.5.0.post1 used outlines 0.0.38. I don't know if that update broke it or some other changes on vllm.

@stas00
Copy link
Contributor Author

stas00 commented Oct 28, 2024

so we ended up using json_repair to solve this issue.

@stas00 stas00 closed this as completed Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants