-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New automatic layers #1012
New automatic layers #1012
Conversation
This is the current version of the fsize algo based on Pyro's algorithm with added padding.
Add debugs and bump padding
Pyro didn't agree with my version, so here is a test with his version
This one cleans up some debug prints, restores the max behavior in case the old alg suits someone better and changes the 200 layers to be the actual max for all backends so users have a better feel for the models.
The new version has been much more accurate, for low vram systems I only notice 1 layer difference. Getting rid of it so users can test if its still in safe margins like I expect. On a 6GB system it results in 18 layers instead of 17 being chosen for Tiefighter.
I'm not feeling confident most people keep their vram usage under 1GB with background tasks. For now since we are aiming to have it work on as many systems as possible I restore the 500MB extra space since the fsize inflation is gone.
I'd also like to request this not to be a squash merge, I think its important that the different attempts at the formula are in the history. |
When using the auto predict we don't want to go over the maximum amount of layers. Users should have a realistic feel for how large the model is. For example when I was using the new auto guesser to communicate if a larger model would fit on someone's system at a higher context, it originally made me think that the model had 60 layers. In reality it had less. This commit will take the layers of the model, and add 3 extra since that is the highest amount of additional layers a backend adds for the context handling (Most its 1).
Turns out at extreme contexts on new models such as Nemo the old code is incorrectly assuming we can offload everything. Its also redundant to check for max layers the old way since I capped our new guesses. Old code is now removed to simplify it, and it changed the nemo guess from 43 layers to 15 layers. Still looking into the 15 part, still seems to high but can be the old algo taking over.
Turns out the old code had a bug where even with 128K layers it assumed it can fully fit Nemo, somehow that check wasn't being met so it didn't do the calculations and thought it was safe to max layer it. |
It's not a bug, the old ctx scale measurement just didnt handling anything past 8k. That can be adjusted. |
@@ -605,21 +605,18 @@ def autoset_gpu_layers(filepath,ctxsize,gpumem): #shitty algo to determine how m | |||
csmul = 1.2 | |||
elif cs and cs > 2048: | |||
csmul = 1.1 | |||
if mem < fsize*1.6*csmul: | |||
ggufmeta = read_gguf_metadata(filepath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm the reason why I did this separately is to avoid performing read operations on files we know clearly fit into vram. For example, selecting tiny llama when you have 24gb vram - no need to even attempt to read header.
We can probably increase the threshold for this? To make it scale so that even at max permitted context we don't get cases where it refuses to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not doing it separately reduces the complexity of the code since the previous cap is no longer a thing. Reading of the file happens so fast it seemed irrelevant to me to try and do rough guesses. It can be restored but then the max code has to be duplicated again.
My reason for it being if there is ever a new edge case where something is very odd this way we are sure to catch it.
ggufmeta = read_gguf_metadata(filepath) | ||
if not ggufmeta or ggufmeta[0]==0: #fail to read or no layers | ||
sizeperlayer = fsize*csmul*0.052 | ||
layerlimit = int(min(200,mem/sizeperlayer)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is good though, should probably do it this way. just that the logic to trigger the read can be a very relaxed condition
ratio = mem/(fsize*csmul*1.5) | ||
if headcount > 0: | ||
ratio = max(ratio, mem - (1.5*1024*1024*1024) - (layers*4*headkvlen*cs*4*1.25))/(fsize + (layers*headcount*headkvlen*cs*4)) | ||
layerlimit = min(int(ratio*layers), (layers + 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like this, but im wondering it +3 may not be enough. I guess avoiding 200 is to make it seem more accurate?
My concern was that the layer count didnt match the actual "offloadable" layers due to kv cache and whatever else a model may use. Maybe +5 might be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+5 works if desired, I chose +3 because I can't remember any of the backends ever doing +5. The current value would be +1 and the historic value +3.
Ideally we want to have this the maximum the backends could possibly need so that its a proper visual indicator for the user. If you know an instance of for example a historic CUDA doing more than +3 we can bump it.
koboldcpp.py
Outdated
headkvlen = (ggufmeta[2] if ggufmeta[2] > 0 else 128) | ||
ratio = mem/(fsize*csmul*1.5) | ||
if headcount > 0: | ||
ratio = max(ratio, mem - (1.5*1024*1024*1024) - (layers*4*headkvlen*cs*4*1.25))/(fsize + (layers*headcount*headkvlen*cs*4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ratio = max(ratio, mem - (1.5*1024*1024*1024) - (layers*4*headkvlen*cs*4*1.25))/(fsize + (layers*headcount*headkvlen*cs*4))
I think the brackets are wrong. Right now you are calculating based off the compute buffer divided by the fsize, and mem is now just a constant which means your "ratio" is going to be waaay larger than 1, probably several hundred thousand. So it will be wrong. You probably misplaced the brackets
mem - (1.5*1024*1024*1024) - (layers*4*headkvlen*cs*4*1.25))/(fsize + (layers*headcount*headkvlen*cs*4)
should be approximately 1 or similarly small number, as it's a ratio. Maybe like 0.1 to 3? Right now it will be a few hundred thousand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had instances of that before in things I didn't commit but in the current builds I am getting correct outputs so that leads me to assume its correct.
Update: Does seem wrong indeed but getting the correct results. It may not matter in practise but ill test if the change behaves the same since it would look cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing what you probably meant to do was
fixed_offset = (1.5*1024*1024*1024)
compute_buffer = (layers*4*headkvlen*cs*4)
ratio = max(ratio, (mem - fixed_offset - (compute_buffer*1.25))/fsize)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try regroup the numbers into additional variables for clarity as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regrouped, but as discussed on Discord its not the same formula as you assumed so the code will have differences from the above.
As requested the different calculations in the algorithm now have their own sections and names so its easier to understand what parts are being used. This also fixes the typo that was caused as a result of it being harder to read, the typo made no difference during execution and the algorithm is confirmed to still work the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay in the interest of speeding things up for the next release I'll merge this as is first and then test and modify as needed later
This is the work between me and Pyro, PR contains the current iteration of the new algorithm which is very close to someones maximum layer count but still slightly conservative. We can possibly get rid of the fsize 1.1 as lower is likely to work especially at larger file sizes.
I also polished the safe 200 layer value, for now I didn't overhaul this entire logic because of the legacy models but if 200 layers is going to be put it will first see if it can get a more accurate number from the GGUF and if so do layers + 3. This allows the user to have a more realistic feel for how big that model is. I know that in more modern models this is 1 extra layer instead of 3, but I wanted to keep edge cases in mind since i know originally it was a +3.
Update: Currently without fsize 1.1 as it appears that its no longer needed. On a 6GB GPU it gets 1 layer of tiefighter more which puts it at 18 which is within the safe margin.
Update 2: Since I expect a sizable amount of people to have background tasks over 1GB I restored the 500MB extra since the fsize bloat no longer takes care of this now.