Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New automatic layers #1012

Merged
Merged
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 11 additions & 14 deletions koboldcpp.py
Original file line number Diff line number Diff line change
Expand Up @@ -605,21 +605,18 @@ def autoset_gpu_layers(filepath,ctxsize,gpumem): #shitty algo to determine how m
csmul = 1.2
elif cs and cs > 2048:
csmul = 1.1
if mem < fsize*1.6*csmul:
ggufmeta = read_gguf_metadata(filepath)
Copy link
Owner

@LostRuins LostRuins Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm the reason why I did this separately is to avoid performing read operations on files we know clearly fit into vram. For example, selecting tiny llama when you have 24gb vram - no need to even attempt to read header.

We can probably increase the threshold for this? To make it scale so that even at max permitted context we don't get cases where it refuses to read

Copy link
Author

@henk717 henk717 Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not doing it separately reduces the complexity of the code since the previous cap is no longer a thing. Reading of the file happens so fast it seemed irrelevant to me to try and do rough guesses. It can be restored but then the max code has to be duplicated again.

My reason for it being if there is ever a new edge case where something is very odd this way we are sure to catch it.

if not ggufmeta or ggufmeta[0]==0: #fail to read or no layers
sizeperlayer = fsize*csmul*0.052
layerlimit = int(min(200,mem/sizeperlayer))
else:
layers = ggufmeta[0]
headcount = ggufmeta[1]
headkvlen = (ggufmeta[2] if ggufmeta[2] > 0 else 128)
ratio = mem/(fsize*csmul*1.5)
if headcount > 0:
ratio = max(ratio,mem/(fsize*1.34 + (layers*headcount*headkvlen*cs*4.25)))
layerlimit = int(ratio*layers)
ggufmeta = read_gguf_metadata(filepath)
if not ggufmeta or ggufmeta[0]==0: #fail to read or no layers
sizeperlayer = fsize*csmul*0.052
layerlimit = int(min(200,mem/sizeperlayer))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good though, should probably do it this way. just that the logic to trigger the read can be a very relaxed condition

else:
layerlimit = 200 # assume full offload
layers = ggufmeta[0]
headcount = ggufmeta[1]
headkvlen = (ggufmeta[2] if ggufmeta[2] > 0 else 128)
ratio = mem/(fsize*csmul*1.5)
if headcount > 0:
ratio = max(ratio, mem - (1.5*1024*1024*1024) - (layers*4*headkvlen*cs*4*1.25))/(fsize + (layers*headcount*headkvlen*cs*4))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ratio = max(ratio, mem - (1.5*1024*1024*1024) - (layers*4*headkvlen*cs*4*1.25))/(fsize + (layers*headcount*headkvlen*cs*4))

I think the brackets are wrong. Right now you are calculating based off the compute buffer divided by the fsize, and mem is now just a constant which means your "ratio" is going to be waaay larger than 1, probably several hundred thousand. So it will be wrong. You probably misplaced the brackets

mem - (1.5*1024*1024*1024) - (layers*4*headkvlen*cs*4*1.25))/(fsize + (layers*headcount*headkvlen*cs*4) should be approximately 1 or similarly small number, as it's a ratio. Maybe like 0.1 to 3? Right now it will be a few hundred thousand.

Copy link
Author

@henk717 henk717 Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had instances of that before in things I didn't commit but in the current builds I am getting correct outputs so that leads me to assume its correct.
Update: Does seem wrong indeed but getting the correct results. It may not matter in practise but ill test if the change behaves the same since it would look cleaner.

Copy link
Owner

@LostRuins LostRuins Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing what you probably meant to do was

fixed_offset = (1.5*1024*1024*1024)
compute_buffer = (layers*4*headkvlen*cs*4)
ratio = max(ratio, (mem - fixed_offset - (compute_buffer*1.25))/fsize)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try regroup the numbers into additional variables for clarity as above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regrouped, but as discussed on Discord its not the same formula as you assumed so the code will have differences from the above.

layerlimit = min(int(ratio*layers), (layers + 3))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this, but im wondering it +3 may not be enough. I guess avoiding 200 is to make it seem more accurate?

My concern was that the layer count didnt match the actual "offloadable" layers due to kv cache and whatever else a model may use. Maybe +5 might be better?

Copy link
Author

@henk717 henk717 Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+5 works if desired, I chose +3 because I can't remember any of the backends ever doing +5. The current value would be +1 and the historic value +3.
Ideally we want to have this the maximum the backends could possibly need so that its a proper visual indicator for the user. If you know an instance of for example a historic CUDA doing more than +3 we can bump it.

return layerlimit
except Exception as ex:
return 0
Expand Down