-
Notifications
You must be signed in to change notification settings - Fork 633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIP runtime memory issue for Llama 3.1 70B F16. #18864
Comments
132GB of allocated device memory is a lot - just because you have that much physical memory does not mean that all of it can be allocated. We never even get through loading parameters in that trace before it runs out. The path that may have better luck is using parameter slabs instead of loading individual parameters ( |
The device I'm using has approximately 200GB of memory. I updated the filebin with two additional files. I halved the number of attention layers in the model so the model would use approximately half the memory. The tracy capture shows it failing at exactly the same point in the memory loading. To me, this looks like a memory/allocation bug. |
Added two more files to the filebin with just one attention layer and it did finally run. I would think this would put an upper bound on the remaining additional memory required to be the 25GB showing up as the maximum in the tracy profile, or some small percentage on top of that for imperfect allocation. |
The problem when running up against physical memory limits is that it's not something you can reason about as a sum: you can almost never use all of the physical memory on a system and the more small/odd-sized allocations you make the less likely you are to reach towards the total. See https://en.wikipedia.org/wiki/Fragmentation_(computing). The way this model is built and compiled has too many allocations that run too close to the system limits. The allocations need to be coalesced/combined and even then it may still have issues. Unfortunately when you fly so close to the limits and get out-of-memory errors the next step is to fix your algorithm/inputs as any compiler/runtime isn't magic and if one works and one doesn't it's usually a coin toss. HIP (and the various layers of the stack it goes through) makes it several coin tosses and the chance of them landing heads for each toss between your input program, the compiler, the runtime, HIP, the 3 layers under HIP between it and the hardware, and the hardware itself is low. The trick is usually to reduce the number of coins you need to toss. All this doesn't mean there aren't bugs, just that triaging is going to take digging into what your model is actually doing, what the layers beneath it are doing, and what you can improve :( The |
@AWoloszyn can you have a look. Something is wrong at the low level here and may be an independent problem, but hard to say exactly |
The total list of allocations made to hipMallocAsync is here (nothing is ever freed). Before we even try to make the final failed allocation we have |
The callstack for the allocation looks like:
|
heh, yeah, that'll be a problem :P That 5.5min hipHostRegister call is pretty crazy - I bet it's paging in the entire memory mapped file as it is pinning the memory - kind of defeats the purpose of streaming, but good to know! |
Tracking this a little bit higher in the stack:
After that I lose the allocation. Based on:
It looks like we are allocating all possible transient data up-front, but hard to see |
yeah, we suballocate, produce a max value, and then allocate that - if you --mlir-print-ir-before=iree-stream-schedule-allocation / --mlir-print-ir-after=iree-stream-schedule-allocation it'll make it easier to see what's mapping to what |
So following through the (before=iree-stream-schedule-allocation): %cst_1
But it really looks like perhaps this transient size being enormous is more related to the fact that the initializer looks like
So we have to allocate all of the memory up-front to hold all of these results (and the results look MOSTLY like transposes of parameters). So there are maybe 2 problems, but when our weight are 130GB by themselves, we really can't afford to have any copies of parameters around at all, even if we solved the transient buffer problem (which we should be able to do by serializing all of these and moving the transient to the final destination between each dispatch, or even just writing directly into the final location). |
Nice, you've found it - that's what I suspected. As you note when models get this big (though I'd argue for anything deployed of any size) we need to be baking out initializers into new parameter files and not doing this at runtime. I've got that on my TODO list. I believe in this case we are lucking out - if the |
Baking out the parameter pack would be good. But in this case, the intent at the model level was to not have any parameter transpositions -- even if the compiler did it, data movement of this size is expensive. So the modeling tools make an effort to minimize that. Of course, we may have gotten it precisely backwards. Or broken it in some other way. I need to get debug info fixed so this is all less opaque. |
Yes there is:
But 1) it is significantly smaller (768 bytes) and 2) SEEMS to be used only in a very small subset of the initialization dispatches. |
In our input MLIR we have this:
Which (as far as I can tell) is hoisted out in the |
Am I taking crazy pills? I could have sworn we were being smarter than this. This is pretty basic... Ok, wait. So the parameter at rest is already an ideal layout, but it is being transposed to feed into a regular mm. Really, that op should (somehow) become an mm with a transposed rhs. In this case, we should be folding that transpose into the mm and then not hoisting anything. This is literally the most common sequence in ml inference and just needs to be right. I have a feeling that this folding is only happening as part of fusion after hoisting or something. |
This looks like some kind of folding issue. That transpose should never become "unglued" and hoisted separately. |
That's great news :) Thinking for when cases worse than this arise something that we should do is have some analysis that forces stream partitioning to min-peak-memory when execution is happening transitively within an initializer. We want concurrency if we can get it (like here) but don't want to increase memory consumption more than required in the startup phase. This is controllable with an attribute today: |
It would be good for this to work even without the folder though because we'll be reaching for (almost) exactly this pattern with data tiling. |
Could you try with these flags
They should help with some unnecessary broadcasts/transposes etc. These are at this point effectively default for RoCM backend |
Same general error when running with those flags, at least on 405b. @aviator19941 said he would try out 70b. |
Adding |
@aviator19941 the default should fix it now it seems, has anyone tried 70B in main branch and seen issues resolved? |
for generating benchmark/tracy profile, I am still hitting this issue with 70B and 405B fp16 during iree-run-module/benchmark command: for generating MLIR
compilation
runtime
iree version : IREE compiler version 3.1.0rc20241218 @ 8ae1b54 |
@pdhirajkumarprasad I haven't tried 405b and I'm not sure if it has ever worked. I could be wrong, but I think 405b is too large to run unsharded on a single mi300x. I retried 70b with IREE at 83af679 and shark-ai at 7862ff8aef1cbc0ab5ceea48afebabef00402c09 and was able to get successful benchmark results with the same
I changed it to use 70b inputs/vmfb:
I ran this on a SPX machine. CPX has 8x less memory, possibly causing the OOM? Or maybe there were other processes eating up vram? I'm not sure what could be causing this discrepancy. |
What happened?
When running with ROCM/HIP on an MI300x, I am encountering the following error:
When I examine the tracy profile, the memory in use at the time of the crash is about 66% of the system memory, and is thus rather confusing that it runs out of memory.
Steps to reproduce your issue
Using
70b_f16.mlir
obtained from the SHARK-Platform model export processWhat component(s) does this issue relate to?
Runtime
Version information
Commit hash: 1e155cc
Additional context
Llama 3.1 8B F16 and Llama 3.1 70B Q4_1 both seem to not run into this issue.
MLIR and tracy profile available here
The text was updated successfully, but these errors were encountered: