-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New automatic layers #1012
New automatic layers #1012
Changes from all commits
be74ae4
ea86b08
2bf4c09
2beb2dc
29b83ea
cc6bb23
6499c9d
22b3975
d5dd00c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -605,21 +605,21 @@ def autoset_gpu_layers(filepath,ctxsize,gpumem): #shitty algo to determine how m | |
csmul = 1.2 | ||
elif cs and cs > 2048: | ||
csmul = 1.1 | ||
if mem < fsize*1.6*csmul: | ||
ggufmeta = read_gguf_metadata(filepath) | ||
if not ggufmeta or ggufmeta[0]==0: #fail to read or no layers | ||
sizeperlayer = fsize*csmul*0.052 | ||
layerlimit = int(min(200,mem/sizeperlayer)) | ||
else: | ||
layers = ggufmeta[0] | ||
headcount = ggufmeta[1] | ||
headkvlen = (ggufmeta[2] if ggufmeta[2] > 0 else 128) | ||
ratio = mem/(fsize*csmul*1.5) | ||
if headcount > 0: | ||
ratio = max(ratio,mem/(fsize*1.34 + (layers*headcount*headkvlen*cs*4.25))) | ||
layerlimit = int(ratio*layers) | ||
ggufmeta = read_gguf_metadata(filepath) | ||
if not ggufmeta or ggufmeta[0]==0: #fail to read or no layers | ||
sizeperlayer = fsize*csmul*0.052 | ||
layerlimit = int(min(200,mem/sizeperlayer)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is good though, should probably do it this way. just that the logic to trigger the read can be a very relaxed condition |
||
else: | ||
layerlimit = 200 # assume full offload | ||
layers = ggufmeta[0] | ||
headcount = ggufmeta[1] | ||
headkvlen = (ggufmeta[2] if ggufmeta[2] > 0 else 128) | ||
ratio = mem/(fsize*csmul*1.5) | ||
computemem = layers*4*headkvlen*cs*4*1.25 # For now the first 4 is the hardcoded result for a blasbatchsize of 512. Ideally we automatically calculate blasbatchsize / 4 but I couldn't easily grab the value yet - Henk | ||
contextmem = layers*headcount*headkvlen*cs*4 | ||
reservedmem = 1.5*1024*1024*1024 # Users often don't have their GPU's VRAM worth of memory, we assume 500MB to avoid driver swapping + 500MB for the OS + 500MB for background apps / browser - Henk | ||
if headcount > 0: | ||
ratio = max(ratio, (mem - reservedmem - computemem) / (fsize + contextmem)) | ||
layerlimit = min(int(ratio*layers), (layers + 3)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i like this, but im wondering it +3 may not be enough. I guess avoiding 200 is to make it seem more accurate? My concern was that the layer count didnt match the actual "offloadable" layers due to kv cache and whatever else a model may use. Maybe +5 might be better? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +5 works if desired, I chose +3 because I can't remember any of the backends ever doing +5. The current value would be +1 and the historic value +3. |
||
return layerlimit | ||
except Exception as ex: | ||
return 0 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm the reason why I did this separately is to avoid performing read operations on files we know clearly fit into vram. For example, selecting tiny llama when you have 24gb vram - no need to even attempt to read header.
We can probably increase the threshold for this? To make it scale so that even at max permitted context we don't get cases where it refuses to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not doing it separately reduces the complexity of the code since the previous cap is no longer a thing. Reading of the file happens so fast it seemed irrelevant to me to try and do rough guesses. It can be restored but then the max code has to be duplicated again.
My reason for it being if there is ever a new edge case where something is very odd this way we are sure to catch it.