input_pos_maxp1 as a Python integer #2016
Merged
+18
−18
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi there 👋
While I was profiling the code to see whether I can improve the speed of speculative decoding, I noticed two weird things:
cudaStreamSynchronize
call inCausalSelfAttention.forward
frommodel.py
Note
All number are provided for
Qwen2.5-7B-Instruct
onNvidia L4
This is caused by implicit call of
.item()
when doing slicing with a tensor.litgpt/litgpt/model.py
Lines 418 to 421 in 3d66f32
It's worth admitting, when @mseeger added
input_pos_maxp1
he initially used it as a Python integer, but I insisted on changing it to a tensor, so it works properly with the rest of the code (for example a function that moves arguments to a device in sequential generation code).Now it's time to roll it back 😊.
When doing a quick benchmark multiple times by generating 500 new tokens, the speed was improved by ~1 token/sec (
16.19
in this PR vs15.10
in main branch).