-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Bugfix][TPU][V1] Fix recompilation #15553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
nice job nicolo! |
vllm/v1/worker/tpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MIN_NUM_SEQS is 8, should this be rounding up to the nearest number divisible by MIN_NUM_SEQS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessarily see in capture_model, max_num_reqs still gets compiled anyway due to how padding with upper limit is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw I'm ok with forcing everything to be nicely divisible by MIN_NUM_SEQS; just I remember max_num_seqs used to be padded and then it was changed for reasons
|
tests/tpu/test_compilation.py failed it seems w/ this error: Processed prompts: 100%|████████████████████████████████████████████████████| 1/1 [00:45<00:00, 46.00s/it, est. speed input: 109.60 toks/s, output: 0.11 toks/s] | WARNING 03-26 16:58:08 [parallel_state.py:1093] torch._C._host_emptyCache() only available in Pytorch >=2.5 Source: https://buildkite.com/vllm/fastcheck/builds/18374#0195d341-8f23-4eb8-bd34-998e6055b2ff |
@NickLucche would we be able to set --enforce-eager=False on all tests except test_compilation.py please to capture whether recompilation is resolved fully? |
It’s false by default. |
Am I reading this wrong, where should I be checking this otherwise? |
vllm/v1/worker/tpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NickLucche thank you for fixing this! qq: is this change actually fix the recompilation issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also have such confusion. IMO, the code fix the recompilation issue when max_num_reqs is not power of 2. But in our tests, it's already power of 2.
.buildkite/run-tpu-v1-test.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to also enable the following in test_basic.py? enforce_eager=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. We should use the default value of enforce_eager (which is False) in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, it is set to True in test_basic.py
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NickLucche Thanks for fixing this. Left a few comments.
vllm/v1/worker/tpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use torch.compile for this method?
Usually it's easier to know the boundary of TPU computation and avoid recompilation if we wrap it inside a torch.compile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once main is stable we'll turn that on. I'd like to do that in a separate PR. Last time around compilation got slower, just wanted to be cautious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an experiment base on you branch, I cleaned the xla cache each time before the execution.
without torch.compile: sampler pre-compilation time is 35.79 [secs]
with torch.compile: sampler pre-compilation time is 35.51 [secs]
The compilation time difference is negligible. In the meantime, torch.compile can speed up the execution because the guard check of torch.compile is usually faster than torch/xla's graph trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to add it back then! Just would like to merge this PR first if you don't mind, it still fixes the pre-compiliation when max_num_reqs is not a power of 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add it here in this PR? Just one line code change.
.buildkite/run-tpu-v1-test.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. We should use the default value of enforce_eager (which is False) in most cases.
vllm/v1/worker/tpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd better add xm.mark_step() before this line unless we use torch.compile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it redundant? A sync to CPU will still cause the graph to be flushed and executed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR, xm.mark_step computes all the pending tensor, but B.cpu() only computes the exact tensor B.
E.g. say we have the code
A = ...
B = op(A)
The graph output generated by xm.mark_step() and B.cpu() are different.
For xm.mark_step(), we will get both A and B as outputs.
For B.cpu(), only B is the output.
Then if we have another xm.mark_step() later, as A's result is not returned in the previous computation, we have to compute A again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining this is really helpful.
I don't think this is the case as we only need the sampled output tokens from the sampler step, but I could also add it for completeness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we don't intend to use other tensors, a later xm.mark_step will still try to get the results of them. Then we have redundant computation.
BTW, we have an implicit xm.mark_step when using torch.compile. What's one reason I recommend using torch.compile when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And when we use torch.compile, we don't need so many xm.mark_step.
torch.compile is much more easier for pytorch developer to understand.
So I've added the test you mentioned in the previous PR, Mentioned it in this thread too #15309. |
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
957f043 to
2e61e31
Compare
|
Also another test worth running that I had mentioned on slack but not here is a benchmark with the check recompilation on. Feel free to ping me if you spot any case where the check fails though! |
|
Can you push a dummy commit to retry the doc build? |
|
thanks for the ping @DarkLight1337 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this @NickLucche ! @yaochengji recommended this yesterday and it's a good idea: could we please set enforce-eager for all tests (like test_basic.py, where this is still set to True currently) except for test_compilation.py. This ensures all tests are checking for recompilation and so nothing slips inadvertently in the future.
|
Sure, it's a bit out of scope for this PR I'd suggest we open another one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your fix again. Could you address my comments in this PR?
vllm/v1/worker/tpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we don't intend to use other tensors, a later xm.mark_step will still try to get the results of them. Then we have redundant computation.
BTW, we have an implicit xm.mark_step when using torch.compile. What's one reason I recommend using torch.compile when possible.
vllm/v1/worker/tpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And when we use torch.compile, we don't need so many xm.mark_step.
torch.compile is much more easier for pytorch developer to understand.
vllm/v1/worker/tpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add it here in this PR? Just one line code change.
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
It appears recompilation was due to a misconfiguration when pre-compiling the num_reqs.
Added tests to CI to confirm this upstream.