-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization #3567
Comments
Is the JSON schema complex at all, and is it the same each time? The
I'm interested in fixing the performance here. |
Hi Simon, The JSON schema is the same at all times, and it is as follows:
Thanks for looking into this 🫶 |
@simon-mo any update on this? 😊 |
Facing similar issue here, I have a json with 14 fields, the request stucks forever. |
My schema only has 2 fields and also has significant latency issues than when using without guided_json. Would love to have this fixed as model performance severely decreases without it. |
If testing lm-format-enforcer, I highly recommend adding the latest version of it to the image, as there have been performance improvements to the JsonSchemaParser. The next version of vLLM will include them, but until them, do |
what speeds are you getting @noamgat vs the outlines backend? |
I didn't test on A100/H100s, but on my dev setup (GTX 3090, Mistral7B), for simple schemas, I was getting a less than 2x reduction of tokens/s. |
+1, it seems not GPU related, I tested with A100 / V100 GPUs both have similar issue. Using line profiler, I found this get_guided_decoding_logits_processor call takes 93% time |
This get Just tested, the speed up is not obvious, probabbly the main bottleneck is still the get_guided_decoding_logits_processor |
|
@noamgat here's a profling when I use lm-format-enforcer 0.10.1.
The two decoding in for loop seems took most time. Happy to make further test if needed. |
Build regular token list only happens at the first request that uses LMFE.
Does it happen every time? If so, maybe there is a problem with lru caching
not working.
…On Tue, May 7, 2024, 20:30 nullpointer0xffff ***@***.***> wrote:
@noamgat <https://github.com/noamgat> here's a profling when I use
lm-format-enforcer 0.10.1.
/lib/python3.10/site-packages/lmformatenforcer/integrations/transformers.py
Function: _build_regular_tokens_list at line 58
Line # Hits Time Per Hit % Time Line Contents
==============================================================
58 @Profile
59 def _build_regular_tokens_list(tokenizer: PreTrainedTokenizerBase) -> List[Tuple[int, str, bool]]:
60 1 912794903.0 9e+08 9.5 token_0 = tokenizer.encode("0")[-1]
61 1 8025.0 8025.0 0.0 regular_tokens = []
62 128257 28050361.0 218.7 0.3 for token_idx in range(len(tokenizer)):
63 128256 78294452.0 610.5 0.8 if token_idx in tokenizer.all_special_ids:
64 2 450.0 225.0 0.0 continue
65 # We prepend token 0 and skip the first letter of the result to get a space if the token is a start word.
66 128254 5319568501.0 41476.8 55.3 decoded_after_0 = tokenizer.decode([token_0, token_idx])[1:]
67 128254 3162992335.0 24661.9 32.9 decoded_regular = tokenizer.decode([token_idx])
68 128254 56427009.0 440.0 0.6 is_word_start_token = len(decoded_after_0) > len(decoded_regular)
69 128254 61975079.0 483.2 0.6 regular_tokens.append((token_idx, decoded_after_0, is_word_start_token))
70 1 240.0 240.0 0.0 return regular_tokens
The two decoding in for loop seems took most time. Happy to make further
test if needed.
—
Reply to this email directly, view it on GitHub
<#3567 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKFA2A6JFYKEQAHJKKQFTTZBEFTBAVCNFSM6AAAAABFDGVKP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJYHE2TQMZUGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Just clarifying - if possible, start the tokens/s measuring and/or profiling from the second request onwards. While the warm-up time is also something that can be optimized, the post-warmup performance matters much more for real-world use cases. This is true for all guided decoding backends. |
@nullpointer0xffff @jens-create I just confirmed the caching of LMFE tokenizer init (very very slow) via @lru_cache is working so
|
maybe we can modify the call method to separate the mask computation from the logits adjustment. This allows the mask to be computed once and reused. let me know if this makes sense @simon-mo |
Just sharing my experience with this issue - Seems to align with the OPs experience. Summary: CPU constrained guidance means that batching can't scale correctly. Vllm 0.4.2
Single request:Outlines: ~70 tps - CPU 100% Batched requests:Outlines: ~70 tps - CPU 100%
Example guidance:regex
~~~response\n# Content\\n([.\\W\\w]+)\\n{2}~{3} json
{"type":"object","properties":{"test":{"type":"string"}},"required":["test"]} |
Here's line timings for model_executor/guided_decoding/outlines_logits_processors.py
|
Based on that timing breakdown, can you try to replace |
I've been doing some further perf analysis and breaking things out a bit to try and understand the bottleneck. Doesn't seem to be related to the indexer but rather, moving the allowed_tokens array around. cpu first, move to gpu
straight to gpu:
|
|
Beyond this, I'm not sure I see a way forward without changes to outlines and lm-format-enforcer to provide the information in a more efficient structure than a List. Does anyone see any memorisation opportunities here to at least reduce the iteration counts? |
One thing I think we could do to make it faster is to use the fact that allowed_tokens is either almost all the tokens, or none of the tokens. Currently the mask is created at -math.inf, but we could also create the mask at 0 if the length of allowed_tokens is < scores.shape[0]/2 and then fill_ with -math.inf instead? |
I went down that same line of thinking - I don't think the timings above support it however. Its getting the python List into a Tensor that seems to be 80%+ of the cost per iteration. So short of data structure changes upstream, my current thinking is we're left with iteration optimisations - can we avoid going back to fsm.allowed_token_ids in certain situations. Not sure on that yet - still learning how this all fits together. |
Are the PRs for this issue currently stalled due to competing priorities? |
For each state in the automata, outlines stores a tensor with a list of legal token IDs. However we don't store these tensors on GPU, so it shouldn't result in CudaOOM. |
I believe this issue has been fixed upstream on outlines on their v0.1.0 release We just need to update Line 21 in 343f8e0
|
@Jason-CKY , unfortunately, that is not likely the root of the issue. The current problem is present whether lm-enforcer or outlines is used for constrained output in vLLM ( #3567 (comment) ). This indicates that the problem is in how vLLM is handling the logits processing. Most likely, it is due to the fact that they are using threads instead of processes. |
I see. Is there any work on this issue right now? I see from this thread that there is a draft PR #6900 that should help when using outlines, but it seems that there isn't any progress after the initial PR and test failure https://buildkite.com/vllm/fastcheck/builds/1264#0190ff17-7643-4664-8108-9d2abc4bf589/192-1046. Just wondering if there is anybody working on it, and if not i'll be happy to help out on this issue with some guidance :) |
@Jason-CKY No one appears to be working on it, so your help would be welcomed. Personally, if I had the time, I would start by using processes instead of threads here
My suspicion is that this is the bottleneck. It may be easier said than done, though depending on which objects need to be serialized and if they can be serialized. Further, I am not sure if any objects need to be shared, in which case you need to introduce a lock. |
Hi, I've done a latency test for json guided generation, and the fix from @lapp0 dottxt-ai/outlines#1013 seems to fix the latency/throughput figures to be on par with un-guided generation. There doesn't seem to be any sort of bottlenecks at least from my preliminary investigations to warrant further optimizations (threads vs processes) that might be slowing down guided generation. The test script as well as results can be found here. For information, these results are run with a RTX A500 gpu, on llama 3.2-3b-instruct unquantized. The latency numbers are also run 10 times and the average number is plotted out on the table below.
the results are run using |
Confused, is that patch in vllm currently? |
No, that fix was in outlines library, I patched the outlines library with the fix and compared the results. The fix was introduced in outlines v0.1.0 as I mentioned in a previous reply. |
The patch mentioned was merged 4 months ago into outlines, thus my confusion. If we're using the current version of outlines library then what's left to patch? |
The current version of vllm is not using the latest version of outlines. The merge 4 months ago from outlines was not properly released until 2 weeks ago (0.1.0), and the latest released version of vllm is pinned to version 0.0.46 of outlines |
My understanding is there is still work on the vllm side to consume the upstream changes. Notably:
These changes were ready in: #6900 - I'd need to refresh that PR against latest main. @Jason-CKY The test harness you've linked to above, whilst the batch sizes are changing for each test scenario, how many concurrent requests/threads are being tested in each batch size? From the quick glance I gave it, it wasn't obvious what the request sizes were. The current performance issue doesn't really manifest until you have multiple concurrent inference requests running within the batch. On our A100 test scenarios, we start to see the request duration increase at 3 concurrent requests on a 48 core machine. |
i was changing the batch size by calling the |
Yes, concurrency is the issue, not |
Indeed, concurrency is the problem. Here is a script to reproduce the bug if you want to understand @Jason-CKY : |
Thanks for the example! I have written another script that awaits multiple calls at the same time using running the test with 30 concurrency calls with and without the patch, i got the following results. I ran each permutation over 10 runs and list down the response times and average times:
i referenced https://github.com/dottxt-ai/outlines/pull/1013/files#diff-202c3676a40bf3fd70a140e8e4fa2959cb88548cf134a7f809ad50e0f6b4176d to implement the outlines patch. Specifically, i replaced the contents of Again, i'm running all of this on There is a big improvement in the latencies, but still significantly slower than without guidance. The high latency on the first run is due to the caching of FSM states for guidance. |
I would love to have even this level of improvement in the base vllm. |
If we are going to bump the library, it might be worth it to consider using outlines-core instead (https://github.com/dottxt-ai/outlines-core) since it is written in Rust and more efficient. |
face the same issue |
Same here |
As always, a thumbs up is preferred over me too comments :) @robcaulk do you know if outlines-core is stable and ready for production use? Just trying to work out if I find a free minute this weekend where I should focus my attention :) |
@lynkz-matt-psaltis it’s a good question. The creators of outlines publicly pinged Vllm when the yannnounced this release (https://www.linkedin.com/posts/activity-7254738693855363072-PHIb). I suppose we can ping them here to ask, @lapp0 @rlouf , is outlines-core recommended now for production use? |
My best guess is that |
Is there a simple way to try this out with a custom built vllm image? |
I'm interested in contributing to improving the throughput/latency of guided decoding. Any pointers to specific areas where investigation would be most valuable, or what are the open challenges now? |
Any update on this? |
Wondering if there has been any progress on this? I'm still observing a roughly 50% increase in latency when using structured outputs with |
We now have integrated xgrammar in 0.6.5 as the default backend in supported cases. We also updated outlines to the latest version using its Rust core. Please give your benchmarks another try! |
Anything you want to discuss about vllm.
Hi,
I am running some benchmarks on the
vllm.entrypoints.openai.api_server
measuring latency and throughput with different number of concurrent requests.Specs:
I am sending 1000 requests with random prompts of token length 512. These are the results I get (see attached image):
Guided_json
Non-guided_json
At 10 concurrent request (GPU utlization << 100%
Non-guided_json: ~20 ms median token time
guided_json: ~ 160 ms median token time
Currently the application I am building heavily relies on guided_json, however, to put it in an online setting I would like to ask 1) are the numbers I experience sensible and 2) what can be done to improve performance in the guided_json paradigm?
I am debating whether I should try and prompt my way to structured outputs and thus avoiding constrained decoding.
)The text was updated successfully, but these errors were encountered: