input_pos_maxp1 as a Python integer #2016

Andrei-Aksionov · 2025-04-15T17:20:43Z

Hi there 👋

While I was profiling the code to see whether I can improve the speed of speculative decoding, I noticed two weird things:

Relatively long KV-cache call (that's whole another story)
cudaStreamSynchronize call in CausalSelfAttention.forward from model.py

Note

All number are provided for Qwen2.5-7B-Instruct on Nvidia L4

This is caused by implicit call of .item() when doing slicing with a tensor.

litgpt/litgpt/model.py

Lines 418 to 421 in 3d66f32

    
           if input_pos_maxp1 is not None: 
        
               # Subselect along sequence dimension 
        
               k = k[..., :input_pos_maxp1, :] 
        
               v = v[..., :input_pos_maxp1, :]

It's worth admitting, when @mseeger added input_pos_maxp1 he initially used it as a Python integer, but I insisted on changing it to a tensor, so it works properly with the rest of the code (for example a function that moves arguments to a device in sequential generation code).
Now it's time to roll it back 😊.

When doing a quick benchmark multiple times by generating 500 new tokens, the speed was improved by ~1 token/sec (16.19 in this PR vs 15.10 in main branch).

mseeger · 2025-04-16T08:12:40Z

Hello, note that

#1934

redoes the whole KV caching, and removes input_pos_maxp1 altogether, which I now recognize as a bit of an ugly hack on my behalf.

Andrei-Aksionov · 2025-04-16T08:52:06Z

Hello
Thanks for pointing this out.
If that PR lands faster - this one can be closed.

Andrei Aksionau added 3 commits April 15, 2025 19:30

input_pos_maxp1 as a python integer

37a0ec9

Call .to if an argument is a tensor for sequential generation

1ee14fc

Update rest of the code and type hints

Loading
Loading status checks…

bc70ff1

Andrei-Aksionov requested review from lantiga, t-vi and Borda as code owners April 15, 2025 17:20

Merge branch 'main' into input_pos_maxp1_as_int

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

fffc8e6

Borda approved these changes Apr 22, 2025

View reviewed changes

Borda merged commit 78c2171 into Lightning-AI:main May 6, 2025
15 checks passed

Andrei-Aksionov deleted the input_pos_maxp1_as_int branch May 6, 2025 11:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

input_pos_maxp1 as a Python integer #2016

input_pos_maxp1 as a Python integer #2016

Andrei-Aksionov commented Apr 15, 2025

Uh oh!

mseeger commented Apr 16, 2025

Uh oh!

Andrei-Aksionov commented Apr 16, 2025

Uh oh!

Uh oh!

	if input_pos_maxp1 is not None:
	# Subselect along sequence dimension
	k = k[..., :input_pos_maxp1, :]
	v = v[..., :input_pos_maxp1, :]

input_pos_maxp1 as a Python integer #2016

input_pos_maxp1 as a Python integer #2016

Conversation

Andrei-Aksionov commented Apr 15, 2025

Uh oh!

Uh oh!

mseeger commented Apr 16, 2025

Uh oh!

Andrei-Aksionov commented Apr 16, 2025

Uh oh!

Uh oh!

Uh oh!