Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New automatic layers #1012

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 14 additions & 14 deletions koboldcpp.py
Original file line number Diff line number Diff line change
Expand Up @@ -605,21 +605,21 @@ def autoset_gpu_layers(filepath,ctxsize,gpumem): #shitty algo to determine how m
csmul = 1.2
elif cs and cs > 2048:
csmul = 1.1
if mem < fsize*1.6*csmul:
ggufmeta = read_gguf_metadata(filepath)
Copy link
Owner

@LostRuins LostRuins Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm the reason why I did this separately is to avoid performing read operations on files we know clearly fit into vram. For example, selecting tiny llama when you have 24gb vram - no need to even attempt to read header.

We can probably increase the threshold for this? To make it scale so that even at max permitted context we don't get cases where it refuses to read

Copy link
Author

@henk717 henk717 Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not doing it separately reduces the complexity of the code since the previous cap is no longer a thing. Reading of the file happens so fast it seemed irrelevant to me to try and do rough guesses. It can be restored but then the max code has to be duplicated again.

My reason for it being if there is ever a new edge case where something is very odd this way we are sure to catch it.

if not ggufmeta or ggufmeta[0]==0: #fail to read or no layers
sizeperlayer = fsize*csmul*0.052
layerlimit = int(min(200,mem/sizeperlayer))
else:
layers = ggufmeta[0]
headcount = ggufmeta[1]
headkvlen = (ggufmeta[2] if ggufmeta[2] > 0 else 128)
ratio = mem/(fsize*csmul*1.5)
if headcount > 0:
ratio = max(ratio,mem/(fsize*1.34 + (layers*headcount*headkvlen*cs*4.25)))
layerlimit = int(ratio*layers)
ggufmeta = read_gguf_metadata(filepath)
if not ggufmeta or ggufmeta[0]==0: #fail to read or no layers
sizeperlayer = fsize*csmul*0.052
layerlimit = int(min(200,mem/sizeperlayer))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good though, should probably do it this way. just that the logic to trigger the read can be a very relaxed condition

else:
layerlimit = 200 # assume full offload
layers = ggufmeta[0]
headcount = ggufmeta[1]
headkvlen = (ggufmeta[2] if ggufmeta[2] > 0 else 128)
ratio = mem/(fsize*csmul*1.5)
computemem = layers*4*headkvlen*cs*4*1.25 # For now the first 4 is the hardcoded result for a blasbatchsize of 512. Ideally we automatically calculate blasbatchsize / 4 but I couldn't easily grab the value yet - Henk
contextmem = layers*headcount*headkvlen*cs*4
reservedmem = 1.5*1024*1024*1024 # Users often don't have their GPU's VRAM worth of memory, we assume 500MB to avoid driver swapping + 500MB for the OS + 500MB for background apps / browser - Henk
if headcount > 0:
ratio = max(ratio, (mem - reservedmem - computemem) / (fsize + contextmem))
layerlimit = min(int(ratio*layers), (layers + 3))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this, but im wondering it +3 may not be enough. I guess avoiding 200 is to make it seem more accurate?

My concern was that the layer count didnt match the actual "offloadable" layers due to kv cache and whatever else a model may use. Maybe +5 might be better?

Copy link
Author

@henk717 henk717 Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+5 works if desired, I chose +3 because I can't remember any of the backends ever doing +5. The current value would be +1 and the historic value +3.
Ideally we want to have this the maximum the backends could possibly need so that its a proper visual indicator for the user. If you know an instance of for example a historic CUDA doing more than +3 we can bump it.

return layerlimit
except Exception as ex:
return 0
Expand Down