HIP runtime memory issue for Llama 3.1 70B F16. #18864

KyleHerndon · 2024-10-21T21:56:27Z

What happened?

When running with ROCM/HIP on an MI300x, I am encountering the following error:

RESOURCE_EXHAUSTED; HIP driver error 'hipErrorOutOfMemory' (2): out of memory; while invoking native function hal.device.queue.alloca; while calling import;

When I examine the tracy profile, the memory in use at the time of the crash is about 66% of the system memory, and is thus rather confusing that it runs out of memory.

Steps to reproduce your issue

Using 70b_f16.mlir obtained from the SHARK-Platform model export process

iree-compile 70b_f16.mlir   --iree-hal-target-backends=rocm   --iree-hal-executable-debug-level=3 --iree-hip-target=gfx942   -o 70b_f16.vmfb
ROCR_VISIBLE_DEVICES=0 iree-run-module   --device=hip://0   --hip_use_streams=true   --hip_allow_inline_execution=true   --device_allocator=caching   --module=./70b_f16.vmfb   --parameters=model=./llama70b_f16.irpa   --function=prefill_bs4   --input=4x16xsi64   --input=4xsi64   --input=4x1xsi64   --input=128x2621440xf16

What component(s) does this issue relate to?

Runtime

Version information

Commit hash: 1e155cc

Additional context

Llama 3.1 8B F16 and Llama 3.1 70B Q4_1 both seem to not run into this issue.
MLIR and tracy profile available here

The text was updated successfully, but these errors were encountered:

benvanik · 2024-10-21T22:47:57Z

132GB of allocated device memory is a lot - just because you have that much physical memory does not mean that all of it can be allocated. We never even get through loading parameters in that trace before it runs out. The path that may have better luck is using parameter slabs instead of loading individual parameters (--iree-stream-resource-memory-model=discrete). But if you don't have >132GB of device-addressable memory then this model is unlikely to work. There's a fallback that could be done by putting parameters in managed memory and making the device demand-load them, but that's usually 10-100x slower and probably not what you want for anything but proof-of-life.

KyleHerndon · 2024-10-24T19:48:01Z

The device I'm using has approximately 200GB of memory. I updated the filebin with two additional files. I halved the number of attention layers in the model so the model would use approximately half the memory. The tracy capture shows it failing at exactly the same point in the memory loading.

To me, this looks like a memory/allocation bug.

KyleHerndon · 2024-10-24T20:08:43Z

Added two more files to the filebin with just one attention layer and it did finally run. I would think this would put an upper bound on the remaining additional memory required to be the 25GB showing up as the maximum in the tracy profile, or some small percentage on top of that for imperfect allocation.

benvanik · 2024-10-24T21:41:33Z

The problem when running up against physical memory limits is that it's not something you can reason about as a sum: you can almost never use all of the physical memory on a system and the more small/odd-sized allocations you make the less likely you are to reach towards the total. See https://en.wikipedia.org/wiki/Fragmentation_(computing).

The way this model is built and compiled has too many allocations that run too close to the system limits. The allocations need to be coalesced/combined and even then it may still have issues. Unfortunately when you fly so close to the limits and get out-of-memory errors the next step is to fix your algorithm/inputs as any compiler/runtime isn't magic and if one works and one doesn't it's usually a coin toss. HIP (and the various layers of the stack it goes through) makes it several coin tosses and the chance of them landing heads for each toss between your input program, the compiler, the runtime, HIP, the 3 layers under HIP between it and the hardware, and the hardware itself is low. The trick is usually to reduce the number of coins you need to toss.

All this doesn't mean there aren't bugs, just that triaging is going to take digging into what your model is actually doing, what the layers beneath it are doing, and what you can improve :(

The 70b_f16-one-rocm-prefill.tracy is kinda scary - hopefully you had a breakpoint in that and not that it took HIP 5.5mins to say "nah, can't give you 2GB" - it does look like that, though. That's insane. Have you tried rebooting/formatting/reinstalling that machine? There's nothing on a computer nowadays that should take 5.5mins to say "no" and to me indicates a HIP bug.

stellaraccident · 2024-10-24T23:47:52Z

@AWoloszyn can you have a look. Something is wrong at the low level here and may be an independent problem, but hard to say exactly

AWoloszyn · 2024-10-25T16:12:28Z

The total list of allocations made to hipMallocAsync is here (nothing is ever freed).

Before we even try to make the final failed allocation we have 141110050816 bytes allocated (131GB) allocated which have been successful. However the NEXT allocation that is requested (that fails) is 278010004096 bytes (260GB) itself

AWoloszyn · 2024-10-25T16:19:28Z

The callstack for the allocation looks like:

#0  iree_hal_hip_memory_pools_allocate (pools=0x555555a36300, stream=0x555555a857f0, pool=0, params=..., allocation_size=278010004096, out_buffer=0x7fffffff86f8) at /home/awoloszy/Development/iree/runtime/src/iree/hal/drivers/hip/memory_pools.c:232
#1  0x000055555564b6a0 in iree_hal_hip_device_queue_alloca (base_device=0x555555a361f0, queue_affinity=18446744073709551615, wait_semaphore_list=..., signal_semaphore_list=..., pool=0, params=..., allocation_size=278010004096, out_buffer=0x7fffffff86f8)     at /home/awoloszy/Development/iree/runtime/src/iree/hal/drivers/hip/hip_device.c:951
#2  0x0000555555605579 in iree_hal_device_queue_alloca (device=0x555555a361f0, queue_affinity=18446744073709551615, wait_semaphore_list=..., signal_semaphore_list=..., pool=0, params=..., allocation_size=278010004096, out_buffer=0x7fffffff86f8) at /home/awoloszy/Development/iree/runtime/src/iree/hal/device.c:97
#3  0x000055555579d090 in iree_hal_module_device_queue_alloca (stack=0x7fffffffb4d0, module=0x5555560e9e80, state=0x5555560ea010, args=0x7fffffff8a90, rets=0x7fffffff8a80) at /home/awoloszy/Development/iree/runtime/src/iree/modules/hal/module.c:1153
#4  0x00005555557bc9ff in iree_vm_shim_rIrriiiI_r (stack=0x7fffffffb4d0, flags=1, args_storage=..., rets_storage=..., target_fn=0x55555579ced0 <iree_hal_module_device_queue_alloca>, module=0x5555560e9e80, module_state=0x5555560ea010) at /home/awoloszy/Development/iree/runtime/src/iree/vm/shims.c:72
#5  0x00005555557b6e04 in iree_vm_native_module_issue_call (module=0x5555560e9e80, stack=0x7fffffffb4d0, callee_frame=0x5555560f0e10, flags=1, args_storage=..., rets_storage=...) at /home/awoloszy/Development/iree/runtime/src/iree/vm/native_module.c:364
#6  0x00005555557b69b7 in iree_vm_native_module_begin_call (self=0x5555560e9e80, stack=0x7fffffffb4d0, call=...) at /home/awoloszy/Development/iree/runtime/src/iree/vm/native_module.c:420
#7  0x000055555571b660 in iree_vm_bytecode_issue_import_call (stack=0x7fffffffb4d0, call=..., cconv_results=..., dst_reg_list=0x7ffff076bc5a, out_caller_frame=0x7fffffffb1a8, out_caller_registers=0x7fffffffb1c0) at /home/awoloszy/Development/iree/runtime/src/iree/vm/bytecode/dispatch.c:452
#8  0x0000555555719198 in iree_vm_bytecode_call_import (stack=0x7fffffffb4d0, module_state=0x5555560ea090, import_ordinal=2147483661, caller_registers=..., src_reg_list=0x7ffff076bc48, dst_reg_list=0x7ffff076bc5a, out_caller_frame=0x7fffffffb1a8, out_caller_registers=0x7fffffffb1c0) at /home/awoloszy/Development/iree/runtime/src/iree/vm/bytecode/dispatch.c:568
#9  0x000055555570eedb in iree_vm_bytecode_dispatch (stack=0x7fffffffb4d0, module=0x555555a36b40, current_frame=0x5555560ebef8, regs=..., call_results=...) at /home/awoloszy/Development/iree/runtime/src/iree/vm/bytecode/dispatch.c:1685
#10 0x0000555555703fe4 in iree_vm_bytecode_dispatch_begin (stack=0x7fffffffb4d0, module=0x555555a36b40, call=..., cconv_arguments=..., cconv_results=...) at /home/awoloszy/Development/iree/runtime/src/iree/vm/bytecode/dispatch.c:636
#11 0x00005555556fe8fa in iree_vm_bytecode_module_begin_call (self=0x555555a36b40, stack=0x7fffffffb4d0, call=...) at /home/awoloszy/Development/iree/runtime/src/iree/vm/bytecode/module.c:845
#12 0x00005555557ac6ee in iree_vm_context_run_function (context=0x5555562d0cb0, stack=0x7fffffffb4d0, module=0x555555a36b40, function_name=...) at /home/awoloszy/Development/iree/runtime/src/iree/vm/context.c:91
#13 0x00005555557ab790 in iree_vm_context_register_modules (context=0x5555562d0cb0, module_count=3, modules=0x7fffffffd648) at /home/awoloszy/Development/iree/runtime/src/iree/vm/context.c:596
#14 0x00005555557ab06a in iree_vm_context_create_with_modules (instance=0x555555a36910, flags=0, module_count=3, modules=0x7fffffffd648, allocator=..., out_context=0x7fffffffd610) at /home/awoloszy/Development/iree/runtime/src/iree/vm/context.c:340
#15 0x00005555555ff2b7 in iree_tooling_create_context_from_flags (instance=0x555555a36910, user_module_count=1, user_modules=0x7fffffffd980, default_device_uri=..., host_allocator=..., out_context=0x7fffffffd918, out_device=0x7fffffffd910, out_device_allocator=0x7fffffffd908) at /home/awoloszy/Development/iree/runtime/src/iree/tooling/context_util.c:625
#16 0x000055555560cd86 in iree_tooling_create_run_context (instance=0x555555a36910, default_device_uri=..., module_contents=..., host_allocator=..., out_context=0x7fffffffdc58, out_function=0x7fffffffdc48, out_device=0x7fffffffdc40, out_device_allocator=0x7fffffffdc38) at /home/awoloszy/Development/iree/runtime/src/iree/tooling/run_module.c:151
#17 0x000055555560c8c7 in iree_tooling_run_module_with_data (instance=0x555555a36910, default_device_uri=..., module_contents=..., host_allocator=..., out_exit_code=0x7fffffffdd34) at /home/awoloszy/Development/iree/runtime/src/iree/tooling/run_module.c:404
#18 0x000055555560c7fb in iree_tooling_run_module_from_flags (instance=0x555555a36910, host_allocator=..., out_exit_code=0x7fffffffdd34) at /home/awoloszy/Development/iree/runtime/src/iree/tooling/run_module.c:387
#19 0x00005555555f766b in main (argc=1, argv=0x7fffffffde88) at /home/awoloszy/Development/iree/tools/iree-run-module-main.c:43

benvanik · 2024-10-25T16:23:34Z

heh, yeah, that'll be a problem :P
I'm going to bet that it's some hoisted initializers that are transposing every single parameter or something ridiculous (260=2*130, probably two copies of everything as it does stuff). It's running during __init after all parameters are loaded. May also be data tiling/packing, as that usually does a pad + a dispatch and would need 2 copies.

That 5.5min hipHostRegister call is pretty crazy - I bet it's paging in the entire memory mapped file as it is pinning the memory - kind of defeats the purpose of streaming, but good to know!

AWoloszyn · 2024-10-25T16:27:41Z

I have not seen the excessive hipHostRegister call. I can get to the error in a "reasonable" amount of time. Looks like the tracy profile also pointed this out for us (helpfully).

I do know if you do NOT specify ROCR_VISIBLE_DEVICES=<one device> hipHostRegister can take a long time if you have many devices on the system.

(tracy has the allocation sizes pointed out as well which I didn't notice before I broke out gdb)

AWoloszyn · 2024-10-28T18:12:34Z

Tracking this a little bit higher in the stack:
With: --dump-compilation-phases-to=

//module.12.vm.mlir
      %ref_970 = vm.call @hal.device.queue.alloca(%__device_0, %c-1_1, %null_0, %ref_969, %zero, %c48, %c527363, %c278010004096) : (!vm.ref<!hal.device>, i64, !vm.ref<!hal.fence>, !vm.ref<!hal.fence>, i32, i32, i32, i64) -> !vm.ref<!hal.buffer>

module @module {
//module.11.hal.mlir
  util.initializer {
     ...
   %1:723 - io_parameters.load<> ...... {
   }
    %fence_0 = hal.fence.create device(%__device_0 : !hal.device) flags("None") : !hal.fence
    %transient_buffer = hal.device.queue.alloca<%__device_0 : !hal.device> affinity(%c-1_i64) wait(%0) signal(%fence_0) pool(%c0_i64) type("DeviceVisible|DeviceLocal") usage("TransferSource|TransferTarget|Transfer|DispatchStorageRead|DispatchStorageWrite|DispatchStorage|SharingImmutable") : !hal.buffer{%c278010004096}
    %status = hal.fence.await until([%fence_0]) timeout_millis(%c-1_i32) : i32

//module.10.executable-targets.mlir
//module.9.executable-configurations.mlir
//module.8.executable-sources.mlir
//module.7.stream.mlir
module @module attributes {stream.affinity.default = #hal.device.affinity<@__device_0>} {
  ...
  util.initializer {
  ...
  %results:723, %result_timepoint = stream.parameter.load on(#hal.device.affinity<@__device_0>) {
  }
   %result, %result_timepoint_0 = stream.resource.alloca uninitialized on(#hal.device.affinity<@__device_0>) : !stream.resource<constant>{%c278010004096} => !stream.timepoint

After that I lose the allocation. Based on:

 %transient_buffer = hal.device.queue.alloca<%__device_0 : !hal.device> affinity(%c-1_i64) wait(%0) signal(%fence_0) pool(%c0_i64) type("DeviceVisible|DeviceLocal")

It looks like we are allocating all possible transient data up-front, but hard to see

benvanik · 2024-10-28T18:15:26Z

yeah, we suballocate, produce a max value, and then allocate that - if you --mlir-print-ir-before=iree-stream-schedule-allocation / --mlir-print-ir-after=iree-stream-schedule-allocation it'll make it easier to see what's mapping to what

AWoloszyn · 2024-10-28T19:08:25Z

So following through the (before=iree-stream-schedule-allocation): %cst_1
We grab the parameter from the parameter pack, transpose it, and then put it elsewhere in memory. So that is one place where we are eating up quite a bit of memory.
One for the parameter in the parameter pack, one as the temporary, and then the final location in memory.

module @module attributes {stream.affinity.default = #hal.device.affinity<@__device_0>} {
  ...
    %results:1291, %result_timepoint = stream.async.execute on(#hal.device.affinity<@__device_0>) with() -> ... {
      ..
      %cst_1 = stream.async.constant : !stream.resource<constant>{%c134217728} = #stream.parameter.named<"model"::"blk.0.attn_q.weight"> : tensor<8192x8192xf16>
      %1:5 = stream.async.concurrent with(....  %cst_1 as %arg1: !stream.resource<constant>{%c134217728} ... ) {
        ...
        %7 = stream.async.dispatch @_initializer_2_dispatch_0::@_initializer_2_dispatch_0_transpose_8192x8192_f16(%arg1[%c0 to %c134217728 for %c134217728]) : (!stream.resource<constant>{%c134217728}) -> !stream.resource<constant>{%c134217728}
        ...
       stream.yield .. %7 ... : ... !stream.resource<constant>{%c134217728}, ...
      }
     ...
     stream.yield %cst, %cst_0, .... %1#1 ...
  }
  util.global.store %results#164, @__hoisted_tensor_8192x8192xf16 : !stream.resource<constant>

But it really looks like perhaps this transient size being enormous is more related to the fact that the initializer looks like

%result:1291 ... {
  %1:5 = stream.async.concurrent ....
  %3:561 = stream.async.concurrent ....
  %5:564 = stream.async.concurrent ...
  stream.yield (1291 values)
}

So we have to allocate all of the memory up-front to hold all of these results (and the results look MOSTLY like transposes of parameters).

So there are maybe 2 problems, but when our weight are 130GB by themselves, we really can't afford to have any copies of parameters around at all, even if we solved the transient buffer problem (which we should be able to do by serializing all of these and moving the transient to the final destination between each dispatch, or even just writing directly into the final location).

benvanik · 2024-10-28T19:20:05Z

Nice, you've found it - that's what I suspected. As you note when models get this big (though I'd argue for anything deployed of any size) we need to be baking out initializers into new parameter files and not doing this at runtime. I've got that on my TODO list. I believe in this case we are lucking out - if the !stream.resource<constant> is the only allocation then that is transposing them into the target. Is there another transient (!stream.resource<transient>) allocation?

stellaraccident · 2024-10-28T19:31:07Z

Baking out the parameter pack would be good. But in this case, the intent at the model level was to not have any parameter transpositions -- even if the compiler did it, data movement of this size is expensive. So the modeling tools make an effort to minimize that.

Of course, we may have gotten it precisely backwards. Or broken it in some other way.

I need to get debug info fixed so this is all less opaque.

AWoloszyn · 2024-10-28T19:59:33Z

Nice, you've found it - that's what I suspected. As you note when models get this big (though I'd argue for anything deployed of any size) we need to be baking out initializers into new parameter files and not doing this at runtime. I've got that on my TODO list. I believe in this case we are lucking out - if the !stream.resource<constant> is the only allocation then that is transposing them into the target. Is there another transient (!stream.resource<transient>) allocation?

Yes there is:

module.7.stream.mlir
%result_1, %result_timepoint_2 = stream.resource.alloca uninitialized on(#hal.device.affinity<@__device_0>) : !stream.resource<transient>{%c768} => !stream.timepoint

But 1) it is significantly smaller (768 bytes) and 2) SEEMS to be used only in a very small subset of the initialization dispatches.

AWoloszyn · 2024-10-28T20:04:34Z

Baking out the parameter pack would be good. But in this case, the intent at the model level was to not have any parameter transpositions -- even if the compiler did it, data movement of this size is expensive. So the modeling tools make an effort to minimize that.

Of course, we may have gotten it precisely backwards. Or broken it in some other way.

I need to get debug info fixed so this is all less opaque.

In our input MLIR we have this:

util.func public @prefill_bs4$async(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view, %arg3: !hal.buffer_view, %arg4: !hal.fence, %arg5: !hal.fence) -> !hal.buffer_view attributes {inlining_policy = #util.inline.never, iree.abi.model = "coarse-fences", iree.abi.stub} {
...
%transposed = linalg.transpose ins(%__auto.blk.0.attn_q.weight : tensor<8192x8192xf16>) outs(%52 : tensor<8192x8192xf16>) permutation = [1, 0] 
...
%56 = linalg.matmul ins(%collapsed, %transposed : tensor<?x8192xf16>, tensor<8192x8192xf16>) outs(%55 : tensor<?x8192xf32>) -> tensor<?x8192xf32>

Which (as far as I can tell) is hoisted out in the HoistIntoGlobalsPass

stellaraccident · 2024-10-28T20:18:33Z

Am I taking crazy pills? I could have sworn we were being smarter than this. This is pretty basic...

Ok, wait. So the parameter at rest is already an ideal layout, but it is being transposed to feed into a regular mm. Really, that op should (somehow) become an mm with a transposed rhs.

In this case, we should be folding that transpose into the mm and then not hoisting anything. This is literally the most common sequence in ml inference and just needs to be right. I have a feeling that this folding is only happening as part of fusion after hoisting or something.

stellaraccident · 2024-10-28T20:38:43Z

This looks like some kind of folding issue. That transpose should never become "unglued" and hoisted separately.

benvanik · 2024-10-28T20:59:19Z

That's great news :)

Thinking for when cases worse than this arise something that we should do is have some analysis that forces stream partitioning to min-peak-memory when execution is happening transitively within an initializer. We want concurrency if we can get it (like here) but don't want to increase memory consumption more than required in the startup phase. This is controllable with an attribute today: stream.partitioning = #stream.partitioning_config<min-peak-memory> - and it can be on any region op to influence everything nested within. We've needed an analysis that tracks function reachability for awhile and if we had it we could have a pass that goes and adds those annotations prior to ScheduleExecutionPass/ScheduleConcurrencyPass.

qedawkins · 2024-10-28T21:28:33Z

It would be good for this to work even without the folder though because we'll be reaching for (almost) exactly this pattern with data tiling.

MaheshRavishankar · 2024-11-06T22:02:45Z

Could you try with these flags

                --iree-dispatch-creation-enable-aggressive-fusion=true \
                --iree-global-opt-propagate-transposes=true \
                --iree-opt-aggressively-propagate-transposes=true \
                --iree-opt-data-tiling=false \
                --iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))" \

They should help with some unnecessary broadcasts/transposes etc. These are at this point effectively default for RoCM backend
I would use it all the time (We will try to turn it on, but that will need work)

KyleHerndon · 2024-11-07T05:30:58Z

Same general error when running with those flags, at least on 405b. @aviator19941 said he would try out 70b.

IanWood1 · 2024-11-14T02:50:49Z

Adding --iree-opt-strip-assertions (which should be made default soon) with #19014 seems to resolve the issue for 70b

kumardeepakamd · 2024-11-24T22:20:05Z

@aviator19941 the default should fix it now it seems, has anyone tried 70B in main branch and seen issues resolved?

pdhirajkumarprasad · 2024-12-19T07:02:53Z

for generating benchmark/tracy profile, I am still hitting this issue with 70B and 405B fp16 during iree-run-module/benchmark

command:

for generating MLIR

python3 -m sharktank.examples.export_paged_llm_v1 \
  --bs=4 \
  --irpa-file=/data/llama3.1/weights/70b/fp16/llama3.1_70b_f16.irpa \
  --output-mlir=70b_fp16_prefill_nondecomposed.mlir \
  --output-config=70b_fp16_prefill_nondecomposed.json \
  --skip-decode

compilation

iree-compile 70b_fp16_prefill_nondecomposed.mlir \
  --iree-hip-target=gfx942 \
  -o=prefill_70b.vmfb \
  --iree-hal-target-device=hip \
  --iree-dispatch-creation-enable-aggressive-fusion=true \
  --iree-global-opt-propagate-transposes=true \
  --iree-opt-aggressively-propagate-transposes=true \
  --iree-opt-data-tiling=false \
  --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' \
  --iree-hal-indirect-command-buffers=true \
  --iree-stream-resource-memory-model=discrete \
  --iree-hip-legacy-sync=false \
  --iree-hal-memoization=true \
  --iree-opt-strip-assertions

runtime

iree-benchmark-module --hip_use_streams=true --device_allocator=caching --module=prefill_405b.vmfb --parameters=model=/data/llama3.1/weights/405b/fp16/llama3.1_405b_fp16.irpa --device=hip://4 --function=prefill_bs4 --input=@/data/llama3.1/weights/70b/prefill_args/tokens.npy --input=@/data/llama3.1/weights/70b/prefill_args/seq_lens.npy --input=@/data/llama3.1/weights/70b/prefill_args/seq_block_ids.npy --input=@/data/llama3.1/weights/70b/prefill_args/cache_state_f16.npy

iree version : IREE compiler version 3.1.0rc20241218 @ 8ae1b54

IanWood1 · 2024-12-20T04:21:31Z

@pdhirajkumarprasad I haven't tried 405b and I'm not sure if it has ever worked. I could be wrong, but I think 405b is too large to run unsharded on a single mi300x.

I retried 70b with IREE at 83af679 and shark-ai at 7862ff8aef1cbc0ab5ceea48afebabef00402c09 and was able to get successful benchmark results with the same iree-compile command. However, I did have to change the iree-benchmark-module command:

iree-benchmark-module --hip_use_streams=true --device_allocator=caching --module=prefill_405b.vmfb --parameters=model=/data/llama3.1/weights/405b/fp16/llama3.1_405b_fp16.irpa --device=hip://4 --function=prefill_bs4 --input=@/data/llama3.1/weights/70b/prefill_args/tokens.npy --input=@/data/llama3.1/weights/70b/prefill_args/seq_lens.npy --input=@/data/llama3.1/weights/70b/prefill_args/seq_block_ids.npy --input=@/data/llama3.1/weights/70b/prefill_args/cache_state_f16.npy

I changed it to use 70b inputs/vmfb:

iree-benchmark-module --hip_use_streams=true --device_allocator=caching --module=prefill_70b.vmfb --parameters=model=/data/llama3.1/weights/70b/fp16/llama3.1_70b_f16.irpa --device=hip://2 --function=prefill_bs4 --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32/tokens.npy --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32/seq_lens.npy --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32/seq_block_ids.npy --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32/cs_f16.npy

I ran this on a SPX machine. CPX has 8x less memory, possibly causing the OOM? Or maybe there were other processes eating up vram? I'm not sure what could be causing this discrepancy.

KyleHerndon added the bug 🐞 Something isn't working label Oct 21, 2024

AWoloszyn self-assigned this Oct 25, 2024

AWoloszyn removed their assignment Oct 25, 2024

stellaraccident assigned qedawkins Oct 28, 2024

MaheshRavishankar assigned IanWood1 and unassigned qedawkins Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIP runtime memory issue for Llama 3.1 70B F16. #18864

HIP runtime memory issue for Llama 3.1 70B F16. #18864

KyleHerndon commented Oct 21, 2024 •

edited

Loading

benvanik commented Oct 21, 2024

KyleHerndon commented Oct 24, 2024

KyleHerndon commented Oct 24, 2024

benvanik commented Oct 24, 2024

stellaraccident commented Oct 24, 2024

AWoloszyn commented Oct 25, 2024

AWoloszyn commented Oct 25, 2024

benvanik commented Oct 25, 2024 •

edited

Loading

AWoloszyn commented Oct 25, 2024

AWoloszyn commented Oct 28, 2024

benvanik commented Oct 28, 2024

AWoloszyn commented Oct 28, 2024 •

edited

Loading

benvanik commented Oct 28, 2024

stellaraccident commented Oct 28, 2024

AWoloszyn commented Oct 28, 2024 •

edited

Loading

AWoloszyn commented Oct 28, 2024

stellaraccident commented Oct 28, 2024

stellaraccident commented Oct 28, 2024

benvanik commented Oct 28, 2024

qedawkins commented Oct 28, 2024 •

edited

Loading

MaheshRavishankar commented Nov 6, 2024

KyleHerndon commented Nov 7, 2024

IanWood1 commented Nov 14, 2024

kumardeepakamd commented Nov 24, 2024

pdhirajkumarprasad commented Dec 19, 2024 •

edited

Loading

IanWood1 commented Dec 20, 2024 •

edited

Loading

HIP runtime memory issue for Llama 3.1 70B F16. #18864

HIP runtime memory issue for Llama 3.1 70B F16. #18864

Comments

KyleHerndon commented Oct 21, 2024 • edited Loading

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

benvanik commented Oct 21, 2024

KyleHerndon commented Oct 24, 2024

KyleHerndon commented Oct 24, 2024

benvanik commented Oct 24, 2024

stellaraccident commented Oct 24, 2024

AWoloszyn commented Oct 25, 2024

AWoloszyn commented Oct 25, 2024

benvanik commented Oct 25, 2024 • edited Loading

AWoloszyn commented Oct 25, 2024

AWoloszyn commented Oct 28, 2024

benvanik commented Oct 28, 2024

AWoloszyn commented Oct 28, 2024 • edited Loading

benvanik commented Oct 28, 2024

stellaraccident commented Oct 28, 2024

AWoloszyn commented Oct 28, 2024 • edited Loading

AWoloszyn commented Oct 28, 2024

stellaraccident commented Oct 28, 2024

stellaraccident commented Oct 28, 2024

benvanik commented Oct 28, 2024

qedawkins commented Oct 28, 2024 • edited Loading

MaheshRavishankar commented Nov 6, 2024

KyleHerndon commented Nov 7, 2024

IanWood1 commented Nov 14, 2024

kumardeepakamd commented Nov 24, 2024

pdhirajkumarprasad commented Dec 19, 2024 • edited Loading

for generating MLIR

compilation

runtime

IanWood1 commented Dec 20, 2024 • edited Loading

KyleHerndon commented Oct 21, 2024 •

edited

Loading

benvanik commented Oct 25, 2024 •

edited

Loading

AWoloszyn commented Oct 28, 2024 •

edited

Loading

AWoloszyn commented Oct 28, 2024 •

edited

Loading

qedawkins commented Oct 28, 2024 •

edited

Loading

pdhirajkumarprasad commented Dec 19, 2024 •

edited

Loading

IanWood1 commented Dec 20, 2024 •

edited

Loading