New automatic layers #1012

henk717 · 2024-07-19T20:51:58Z

This is the work between me and Pyro, PR contains the current iteration of the new algorithm which is very close to someones maximum layer count but still slightly conservative. We can possibly get rid of the fsize 1.1 as lower is likely to work especially at larger file sizes.

I also polished the safe 200 layer value, for now I didn't overhaul this entire logic because of the legacy models but if 200 layers is going to be put it will first see if it can get a more accurate number from the GGUF and if so do layers + 3. This allows the user to have a more realistic feel for how big that model is. I know that in more modern models this is 1 extra layer instead of 3, but I wanted to keep edge cases in mind since i know originally it was a +3.

Update: Currently without fsize 1.1 as it appears that its no longer needed. On a 6GB GPU it gets 1 layer of tiefighter more which puts it at 18 which is within the safe margin.

Update 2: Since I expect a sizable amount of people to have background tasks over 1GB I restored the 500MB extra since the fsize bloat no longer takes care of this now.

This is the current version of the fsize algo based on Pyro's algorithm with added padding.

Add debugs and bump padding

Pyro didn't agree with my version, so here is a test with his version

This one cleans up some debug prints, restores the max behavior in case the old alg suits someone better and changes the 200 layers to be the actual max for all backends so users have a better feel for the models.

The new version has been much more accurate, for low vram systems I only notice 1 layer difference. Getting rid of it so users can test if its still in safe margins like I expect. On a 6GB system it results in 18 layers instead of 17 being chosen for Tiefighter.

I'm not feeling confident most people keep their vram usage under 1GB with background tasks. For now since we are aiming to have it work on as many systems as possible I restore the 500MB extra space since the fsize inflation is gone.

henk717 · 2024-07-20T10:16:48Z

I'd also like to request this not to be a squash merge, I think its important that the different attempts at the formula are in the history.

When using the auto predict we don't want to go over the maximum amount of layers. Users should have a realistic feel for how large the model is. For example when I was using the new auto guesser to communicate if a larger model would fit on someone's system at a higher context, it originally made me think that the model had 60 layers. In reality it had less. This commit will take the layers of the model, and add 3 extra since that is the highest amount of additional layers a backend adds for the context handling (Most its 1).

Turns out at extreme contexts on new models such as Nemo the old code is incorrectly assuming we can offload everything. Its also redundant to check for max layers the old way since I capped our new guesses. Old code is now removed to simplify it, and it changed the nemo guess from 43 layers to 15 layers. Still looking into the 15 part, still seems to high but can be the old algo taking over.

henk717 · 2024-07-20T14:28:22Z

Turns out the old code had a bug where even with 128K layers it assumed it can fully fit Nemo, somehow that check wasn't being met so it didn't do the calculations and thought it was safe to max layer it.
I removed that now, so every model is forced to be calculated to ensure the max is actually the max. The functions are so fast thats negligible performance wise and just a lot cleaner.

LostRuins · 2024-07-21T04:29:30Z

It's not a bug, the old ctx scale measurement just didnt handling anything past 8k. That can be adjusted.

LostRuins · 2024-07-21T14:10:32Z

koboldcpp.py

@@ -605,21 +605,18 @@ def autoset_gpu_layers(filepath,ctxsize,gpumem): #shitty algo to determine how m
                csmul = 1.2
            elif cs and cs > 2048:
                csmul = 1.1
-            if mem < fsize*1.6*csmul:
-                ggufmeta = read_gguf_metadata(filepath)


hmm the reason why I did this separately is to avoid performing read operations on files we know clearly fit into vram. For example, selecting tiny llama when you have 24gb vram - no need to even attempt to read header.

We can probably increase the threshold for this? To make it scale so that even at max permitted context we don't get cases where it refuses to read

Not doing it separately reduces the complexity of the code since the previous cap is no longer a thing. Reading of the file happens so fast it seemed irrelevant to me to try and do rough guesses. It can be restored but then the max code has to be duplicated again.

My reason for it being if there is ever a new edge case where something is very odd this way we are sure to catch it.

LostRuins · 2024-07-21T14:13:07Z

koboldcpp.py

+            ggufmeta = read_gguf_metadata(filepath)
+            if not ggufmeta or ggufmeta[0]==0: #fail to read or no layers
+                sizeperlayer = fsize*csmul*0.052
+                layerlimit = int(min(200,mem/sizeperlayer))


this is good though, should probably do it this way. just that the logic to trigger the read can be a very relaxed condition

LostRuins · 2024-07-21T14:15:02Z

koboldcpp.py

+                ratio = mem/(fsize*csmul*1.5)
+                if headcount > 0:
+                    ratio = max(ratio, mem - (1.5*1024*1024*1024) - (layers*4*headkvlen*cs*4*1.25))/(fsize + (layers*headcount*headkvlen*cs*4))
+                layerlimit = min(int(ratio*layers), (layers + 3))


i like this, but im wondering it +3 may not be enough. I guess avoiding 200 is to make it seem more accurate?

My concern was that the layer count didnt match the actual "offloadable" layers due to kv cache and whatever else a model may use. Maybe +5 might be better?

+5 works if desired, I chose +3 because I can't remember any of the backends ever doing +5. The current value would be +1 and the historic value +3.
Ideally we want to have this the maximum the backends could possibly need so that its a proper visual indicator for the user. If you know an instance of for example a historic CUDA doing more than +3 we can bump it.

LostRuins · 2024-07-21T14:20:54Z

koboldcpp.py

+                headkvlen = (ggufmeta[2] if ggufmeta[2] > 0 else 128)
+                ratio = mem/(fsize*csmul*1.5)
+                if headcount > 0:
+                    ratio = max(ratio, mem - (1.5*1024*1024*1024) - (layers*4*headkvlen*cs*4*1.25))/(fsize + (layers*headcount*headkvlen*cs*4))


ratio = max(ratio, mem - (1.5*1024*1024*1024) - (layers*4*headkvlen*cs*4*1.25))/(fsize + (layers*headcount*headkvlen*cs*4))

I think the brackets are wrong. Right now you are calculating based off the compute buffer divided by the fsize, and mem is now just a constant which means your "ratio" is going to be waaay larger than 1, probably several hundred thousand. So it will be wrong. You probably misplaced the brackets

mem - (1.5*1024*1024*1024) - (layers*4*headkvlen*cs*4*1.25))/(fsize + (layers*headcount*headkvlen*cs*4) should be approximately 1 or similarly small number, as it's a ratio. Maybe like 0.1 to 3? Right now it will be a few hundred thousand.

I had instances of that before in things I didn't commit but in the current builds I am getting correct outputs so that leads me to assume its correct.
Update: Does seem wrong indeed but getting the correct results. It may not matter in practise but ill test if the change behaves the same since it would look cleaner.

I'm guessing what you probably meant to do was

fixed_offset = (1.5*1024*1024*1024) compute_buffer = (layers*4*headkvlen*cs*4) ratio = max(ratio, (mem - fixed_offset - (compute_buffer*1.25))/fsize)

Let's try regroup the numbers into additional variables for clarity as above

Regrouped, but as discussed on Discord its not the same formula as you assumed so the code will have differences from the above.

As requested the different calculations in the algorithm now have their own sections and names so its easier to understand what parts are being used. This also fixes the typo that was caused as a result of it being harder to read, the typo made no difference during execution and the algorithm is confirmed to still work the same.

LostRuins

okay in the interest of speeding things up for the next release I'll merge this as is first and then test and modify as needed later

henk717 added 5 commits July 19, 2024 14:21

Henk's version of the fsize algo

be74ae4

This is the current version of the fsize algo based on Pyro's algorithm with added padding.

Update koboldcpp.py

ea86b08

Add debugs and bump padding

Pyro version

2bf4c09

Pyro didn't agree with my version, so here is a test with his version

Polish new auto layers

2beb2dc

This one cleans up some debug prints, restores the max behavior in case the old alg suits someone better and changes the 200 layers to be the actual max for all backends so users have a better feel for the models.

Remove 10% margin

29b83ea

The new version has been much more accurate, for low vram systems I only notice 1 layer difference. Getting rid of it so users can test if its still in safe margins like I expect. On a 6GB system it results in 18 layers instead of 17 being chosen for Tiefighter.

LostRuins added enhancement New feature or request high priority needs review needs review labels Jul 20, 2024

LostRuins mentioned this pull request Jul 20, 2024

streamline more code snippets, hopefully to make them shorter as well as easier to maintain #1010

Closed

Restore 500MB buffer to play it safe

cc6bb23

I'm not feeling confident most people keep their vram usage under 1GB with background tasks. For now since we are aiming to have it work on as many systems as possible I restore the 500MB extra space since the fsize inflation is gone.

henk717 added 2 commits July 20, 2024 13:38

LostRuins reviewed Jul 21, 2024

View reviewed changes

LostRuins approved these changes Jul 22, 2024

View reviewed changes

LostRuins merged commit e493f14 into LostRuins:concedo_experimental Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New automatic layers #1012

New automatic layers #1012

henk717 commented Jul 19, 2024 •

edited

Loading

henk717 commented Jul 20, 2024

henk717 commented Jul 20, 2024 •

edited

Loading

LostRuins commented Jul 21, 2024

LostRuins Jul 21, 2024 •

edited

Loading

henk717 Jul 21, 2024 •

edited

Loading

LostRuins Jul 21, 2024

LostRuins Jul 21, 2024

henk717 Jul 21, 2024 •

edited

Loading

LostRuins Jul 21, 2024

henk717 Jul 21, 2024 •

edited

Loading

LostRuins Jul 21, 2024 •

edited

Loading

LostRuins Jul 21, 2024

henk717 Jul 21, 2024

LostRuins left a comment

New automatic layers #1012

New automatic layers #1012

Conversation

henk717 commented Jul 19, 2024 • edited Loading

henk717 commented Jul 20, 2024

henk717 commented Jul 20, 2024 • edited Loading

LostRuins commented Jul 21, 2024

LostRuins Jul 21, 2024 • edited Loading

Choose a reason for hiding this comment

henk717 Jul 21, 2024 • edited Loading

Choose a reason for hiding this comment

LostRuins Jul 21, 2024

Choose a reason for hiding this comment

LostRuins Jul 21, 2024

Choose a reason for hiding this comment

henk717 Jul 21, 2024 • edited Loading

Choose a reason for hiding this comment

LostRuins Jul 21, 2024

Choose a reason for hiding this comment

henk717 Jul 21, 2024 • edited Loading

Choose a reason for hiding this comment

LostRuins Jul 21, 2024 • edited Loading

Choose a reason for hiding this comment

LostRuins Jul 21, 2024

Choose a reason for hiding this comment

henk717 Jul 21, 2024

Choose a reason for hiding this comment

LostRuins left a comment

Choose a reason for hiding this comment

henk717 commented Jul 19, 2024 •

edited

Loading

henk717 commented Jul 20, 2024 •

edited

Loading

LostRuins Jul 21, 2024 •

edited

Loading

henk717 Jul 21, 2024 •

edited

Loading

henk717 Jul 21, 2024 •

edited

Loading

henk717 Jul 21, 2024 •

edited

Loading

LostRuins Jul 21, 2024 •

edited

Loading