-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[ray.data.llm] Add hint of how to optimize throughput #52634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ray.data.llm] Add hint of how to optimize throughput #52634
Conversation
Signed-off-by: Linkun Chen <github@lkchen.net>
|
|
||
| # Core stage -- the vLLM engine. | ||
|
|
||
| if config.batch_size * config.max_concurrent_batches < DEFAULT_VLLM_BATCH_SIZE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why DEFAULT_VLLM_BATCH_SIZE is set to be 256? and why does this warning make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
256 comes from vLLM, I've refactored to always read from vLLM instead of hardcoding.
This warning has two parts:
- product of
batch_sizeandmax_concurrent_batchesindicates total concurrent prompts, if this product is too small, vLLM is under-utilized - I want user increase
max_concurrent_batchesinstead ofbatch_size, since the latter cause long-tail blocking
which part doesn't make sense to you, could you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok explanation is clear now. The 256 is really coming from engine_kwargs of vllm. It's not hardcoded inside vllm either. Basically you are saying ray data will adjust itself to the corresponding max_seq number set on the vllm engine replica, by adjusting the max_concurrent_batches instead of adjusting the batch size. Can we get some reliable benchmark datapoints attached to this PR for different combos of batch_size and max_concurrent_batches to show the basis of this choice?
What I mean is that we should run a benchmark for sweep of batch_size and max_concurrent_batches under similar max_seqs.
Basically
for max_seq: [128, 256, 512]:
for (bsize, max_concurrent_batches) in [(1, max_seq), (2, max_seq/2), ..., (max_seq, 1)]:
Measure: E2E runtime on a fixed dataset of say 10k rows
For baseline comparisons also measure E2E time when bsize=10k, max_concurrent_batches=1 on similar max_seq levels.
| max_tasks_in_flight_per_actor=max( | ||
| DEFAULT_MAX_TASKS_IN_FLIGHT, config.max_concurrent_batches | ||
| ), | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raulchen if this is deprecated, what is the right way to control the max_tasks_in_flight_per_actor properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what the comment says, this deprecated field is the only way to control max_tasks_in_flight_per_actor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As users of ray data yes, but ray data should either not deprecate this, or give a more stable solution. I want to understand if this is what is recommended for the issue above. cc @alexeykudinkin @gvspraveen @richardliaw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately, we haven't exposed a new API for this yet.
created a ticket here #52667
For now, let's use the current way.
Signed-off-by: lkchen <github@lkchen.net>
Signed-off-by: Linkun Chen <github@lkchen.net>
Signed-off-by: Linkun Chen <github@lkchen.net>
|
@kouroshHakha yes, I started running max_seq=256 an hour ago |
|
ok discussed offline. With this PR we are basically enabling configuring max_concurrency on a udf actor pools. By modifying bsize and max_concurrency we can shave off an overhead of 20 ish % to 10 ish % compared to async vllm for single replica. The rest of the remaining overhead must be ray serialization, etc which will be insignificant cost for the value of horizontal scaling. Both @lk-chen and I agree that we should put a pin on this and just be aware that in single replica there could be an overhead of 7-10% compared to async vllm. |
Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: lkchen <github@lkchen.net> Signed-off-by: jhsu <jhsu@anyscale.com>
Why are these changes needed?
LLM task is usually long-running, and duration varies a lot. This can easily cause long-tail problem if batch size is too large.
For example, within a batch, most prompts finished, while there's one prompt keep decoding, blocking the whole batch to finish. Ray data cannot schedule more batches, if the long-tail happens in all running batches. And vLLM engine is not saturated in this case (only decoding one prompt from each batch, while vLLM could potentially handle 256 sequence concurrently), causing low throughput.
This PR
batch_sizeto avoid long tail, and largemax_concurrent_batchesto saturate engine.Benchmarking on a 10k ShareGPT dataset, on L40S GPU (vLLM 0.8.4, VLLM_USE_V1=0)
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.