Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIP runtime memory issue for Llama 3.1 70B F16. #18864

Open
KyleHerndon opened this issue Oct 21, 2024 · 26 comments
Open

HIP runtime memory issue for Llama 3.1 70B F16. #18864

KyleHerndon opened this issue Oct 21, 2024 · 26 comments
Assignees
Labels
bug 🐞 Something isn't working

Comments

@KyleHerndon
Copy link
Contributor

KyleHerndon commented Oct 21, 2024

What happened?

When running with ROCM/HIP on an MI300x, I am encountering the following error:

RESOURCE_EXHAUSTED; HIP driver error 'hipErrorOutOfMemory' (2): out of memory; while invoking native function hal.device.queue.alloca; while calling import;

When I examine the tracy profile, the memory in use at the time of the crash is about 66% of the system memory, and is thus rather confusing that it runs out of memory.

Steps to reproduce your issue

Using 70b_f16.mlir obtained from the SHARK-Platform model export process

iree-compile 70b_f16.mlir   --iree-hal-target-backends=rocm   --iree-hal-executable-debug-level=3 --iree-hip-target=gfx942   -o 70b_f16.vmfb
ROCR_VISIBLE_DEVICES=0 iree-run-module   --device=hip://0   --hip_use_streams=true   --hip_allow_inline_execution=true   --device_allocator=caching   --module=./70b_f16.vmfb   --parameters=model=./llama70b_f16.irpa   --function=prefill_bs4   --input=4x16xsi64   --input=4xsi64   --input=4x1xsi64   --input=128x2621440xf16

What component(s) does this issue relate to?

Runtime

Version information

Commit hash: 1e155cc

Additional context

Llama 3.1 8B F16 and Llama 3.1 70B Q4_1 both seem to not run into this issue.
MLIR and tracy profile available here

@KyleHerndon KyleHerndon added the bug 🐞 Something isn't working label Oct 21, 2024
@benvanik
Copy link
Collaborator

132GB of allocated device memory is a lot - just because you have that much physical memory does not mean that all of it can be allocated. We never even get through loading parameters in that trace before it runs out. The path that may have better luck is using parameter slabs instead of loading individual parameters (--iree-stream-resource-memory-model=discrete). But if you don't have >132GB of device-addressable memory then this model is unlikely to work. There's a fallback that could be done by putting parameters in managed memory and making the device demand-load them, but that's usually 10-100x slower and probably not what you want for anything but proof-of-life.

@KyleHerndon
Copy link
Contributor Author

The device I'm using has approximately 200GB of memory. I updated the filebin with two additional files. I halved the number of attention layers in the model so the model would use approximately half the memory. The tracy capture shows it failing at exactly the same point in the memory loading.

To me, this looks like a memory/allocation bug.

@KyleHerndon
Copy link
Contributor Author

Added two more files to the filebin with just one attention layer and it did finally run. I would think this would put an upper bound on the remaining additional memory required to be the 25GB showing up as the maximum in the tracy profile, or some small percentage on top of that for imperfect allocation.

@benvanik
Copy link
Collaborator

The problem when running up against physical memory limits is that it's not something you can reason about as a sum: you can almost never use all of the physical memory on a system and the more small/odd-sized allocations you make the less likely you are to reach towards the total. See https://en.wikipedia.org/wiki/Fragmentation_(computing).

The way this model is built and compiled has too many allocations that run too close to the system limits. The allocations need to be coalesced/combined and even then it may still have issues. Unfortunately when you fly so close to the limits and get out-of-memory errors the next step is to fix your algorithm/inputs as any compiler/runtime isn't magic and if one works and one doesn't it's usually a coin toss. HIP (and the various layers of the stack it goes through) makes it several coin tosses and the chance of them landing heads for each toss between your input program, the compiler, the runtime, HIP, the 3 layers under HIP between it and the hardware, and the hardware itself is low. The trick is usually to reduce the number of coins you need to toss.

All this doesn't mean there aren't bugs, just that triaging is going to take digging into what your model is actually doing, what the layers beneath it are doing, and what you can improve :(

The 70b_f16-one-rocm-prefill.tracy is kinda scary - hopefully you had a breakpoint in that and not that it took HIP 5.5mins to say "nah, can't give you 2GB" - it does look like that, though. That's insane. Have you tried rebooting/formatting/reinstalling that machine? There's nothing on a computer nowadays that should take 5.5mins to say "no" and to me indicates a HIP bug.

@stellaraccident
Copy link
Collaborator

@AWoloszyn can you have a look. Something is wrong at the low level here and may be an independent problem, but hard to say exactly

@AWoloszyn AWoloszyn self-assigned this Oct 25, 2024
@AWoloszyn
Copy link
Contributor

The total list of allocations made to hipMallocAsync is here (nothing is ever freed).

Before we even try to make the final failed allocation we have 141110050816 bytes allocated (131GB) allocated which have been successful. However the NEXT allocation that is requested (that fails) is 278010004096 bytes (260GB) itself

@AWoloszyn
Copy link
Contributor

The callstack for the allocation looks like:

#0  iree_hal_hip_memory_pools_allocate (pools=0x555555a36300, stream=0x555555a857f0, pool=0, params=..., allocation_size=278010004096, out_buffer=0x7fffffff86f8) at /home/awoloszy/Development/iree/runtime/src/iree/hal/drivers/hip/memory_pools.c:232
#1  0x000055555564b6a0 in iree_hal_hip_device_queue_alloca (base_device=0x555555a361f0, queue_affinity=18446744073709551615, wait_semaphore_list=..., signal_semaphore_list=..., pool=0, params=..., allocation_size=278010004096, out_buffer=0x7fffffff86f8)     at /home/awoloszy/Development/iree/runtime/src/iree/hal/drivers/hip/hip_device.c:951
#2  0x0000555555605579 in iree_hal_device_queue_alloca (device=0x555555a361f0, queue_affinity=18446744073709551615, wait_semaphore_list=..., signal_semaphore_list=..., pool=0, params=..., allocation_size=278010004096, out_buffer=0x7fffffff86f8) at /home/awoloszy/Development/iree/runtime/src/iree/hal/device.c:97
#3  0x000055555579d090 in iree_hal_module_device_queue_alloca (stack=0x7fffffffb4d0, module=0x5555560e9e80, state=0x5555560ea010, args=0x7fffffff8a90, rets=0x7fffffff8a80) at /home/awoloszy/Development/iree/runtime/src/iree/modules/hal/module.c:1153
#4  0x00005555557bc9ff in iree_vm_shim_rIrriiiI_r (stack=0x7fffffffb4d0, flags=1, args_storage=..., rets_storage=..., target_fn=0x55555579ced0 <iree_hal_module_device_queue_alloca>, module=0x5555560e9e80, module_state=0x5555560ea010) at /home/awoloszy/Development/iree/runtime/src/iree/vm/shims.c:72
#5  0x00005555557b6e04 in iree_vm_native_module_issue_call (module=0x5555560e9e80, stack=0x7fffffffb4d0, callee_frame=0x5555560f0e10, flags=1, args_storage=..., rets_storage=...) at /home/awoloszy/Development/iree/runtime/src/iree/vm/native_module.c:364
#6  0x00005555557b69b7 in iree_vm_native_module_begin_call (self=0x5555560e9e80, stack=0x7fffffffb4d0, call=...) at /home/awoloszy/Development/iree/runtime/src/iree/vm/native_module.c:420
#7  0x000055555571b660 in iree_vm_bytecode_issue_import_call (stack=0x7fffffffb4d0, call=..., cconv_results=..., dst_reg_list=0x7ffff076bc5a, out_caller_frame=0x7fffffffb1a8, out_caller_registers=0x7fffffffb1c0) at /home/awoloszy/Development/iree/runtime/src/iree/vm/bytecode/dispatch.c:452
#8  0x0000555555719198 in iree_vm_bytecode_call_import (stack=0x7fffffffb4d0, module_state=0x5555560ea090, import_ordinal=2147483661, caller_registers=..., src_reg_list=0x7ffff076bc48, dst_reg_list=0x7ffff076bc5a, out_caller_frame=0x7fffffffb1a8, out_caller_registers=0x7fffffffb1c0) at /home/awoloszy/Development/iree/runtime/src/iree/vm/bytecode/dispatch.c:568
#9  0x000055555570eedb in iree_vm_bytecode_dispatch (stack=0x7fffffffb4d0, module=0x555555a36b40, current_frame=0x5555560ebef8, regs=..., call_results=...) at /home/awoloszy/Development/iree/runtime/src/iree/vm/bytecode/dispatch.c:1685
#10 0x0000555555703fe4 in iree_vm_bytecode_dispatch_begin (stack=0x7fffffffb4d0, module=0x555555a36b40, call=..., cconv_arguments=..., cconv_results=...) at /home/awoloszy/Development/iree/runtime/src/iree/vm/bytecode/dispatch.c:636
#11 0x00005555556fe8fa in iree_vm_bytecode_module_begin_call (self=0x555555a36b40, stack=0x7fffffffb4d0, call=...) at /home/awoloszy/Development/iree/runtime/src/iree/vm/bytecode/module.c:845
#12 0x00005555557ac6ee in iree_vm_context_run_function (context=0x5555562d0cb0, stack=0x7fffffffb4d0, module=0x555555a36b40, function_name=...) at /home/awoloszy/Development/iree/runtime/src/iree/vm/context.c:91
#13 0x00005555557ab790 in iree_vm_context_register_modules (context=0x5555562d0cb0, module_count=3, modules=0x7fffffffd648) at /home/awoloszy/Development/iree/runtime/src/iree/vm/context.c:596
#14 0x00005555557ab06a in iree_vm_context_create_with_modules (instance=0x555555a36910, flags=0, module_count=3, modules=0x7fffffffd648, allocator=..., out_context=0x7fffffffd610) at /home/awoloszy/Development/iree/runtime/src/iree/vm/context.c:340
#15 0x00005555555ff2b7 in iree_tooling_create_context_from_flags (instance=0x555555a36910, user_module_count=1, user_modules=0x7fffffffd980, default_device_uri=..., host_allocator=..., out_context=0x7fffffffd918, out_device=0x7fffffffd910, out_device_allocator=0x7fffffffd908) at /home/awoloszy/Development/iree/runtime/src/iree/tooling/context_util.c:625
#16 0x000055555560cd86 in iree_tooling_create_run_context (instance=0x555555a36910, default_device_uri=..., module_contents=..., host_allocator=..., out_context=0x7fffffffdc58, out_function=0x7fffffffdc48, out_device=0x7fffffffdc40, out_device_allocator=0x7fffffffdc38) at /home/awoloszy/Development/iree/runtime/src/iree/tooling/run_module.c:151
#17 0x000055555560c8c7 in iree_tooling_run_module_with_data (instance=0x555555a36910, default_device_uri=..., module_contents=..., host_allocator=..., out_exit_code=0x7fffffffdd34) at /home/awoloszy/Development/iree/runtime/src/iree/tooling/run_module.c:404
#18 0x000055555560c7fb in iree_tooling_run_module_from_flags (instance=0x555555a36910, host_allocator=..., out_exit_code=0x7fffffffdd34) at /home/awoloszy/Development/iree/runtime/src/iree/tooling/run_module.c:387
#19 0x00005555555f766b in main (argc=1, argv=0x7fffffffde88) at /home/awoloszy/Development/iree/tools/iree-run-module-main.c:43

@benvanik
Copy link
Collaborator

benvanik commented Oct 25, 2024

heh, yeah, that'll be a problem :P
I'm going to bet that it's some hoisted initializers that are transposing every single parameter or something ridiculous (260=2*130, probably two copies of everything as it does stuff). It's running during __init after all parameters are loaded. May also be data tiling/packing, as that usually does a pad + a dispatch and would need 2 copies.

That 5.5min hipHostRegister call is pretty crazy - I bet it's paging in the entire memory mapped file as it is pinning the memory - kind of defeats the purpose of streaming, but good to know!

@AWoloszyn
Copy link
Contributor

I have not seen the excessive hipHostRegister call. I can get to the error in a "reasonable" amount of time. Looks like the tracy profile also pointed this out for us (helpfully).

I do know if you do NOT specify ROCR_VISIBLE_DEVICES=<one device> hipHostRegister can take a long time if you have many devices on the system.

(tracy has the allocation sizes pointed out as well which I didn't notice before I broke out gdb)
Image

@AWoloszyn AWoloszyn removed their assignment Oct 25, 2024
@AWoloszyn
Copy link
Contributor

Tracking this a little bit higher in the stack:
With: --dump-compilation-phases-to=

//module.12.vm.mlir
      %ref_970 = vm.call @hal.device.queue.alloca(%__device_0, %c-1_1, %null_0, %ref_969, %zero, %c48, %c527363, %c278010004096) : (!vm.ref<!hal.device>, i64, !vm.ref<!hal.fence>, !vm.ref<!hal.fence>, i32, i32, i32, i64) -> !vm.ref<!hal.buffer>      
module @module {
//module.11.hal.mlir
  util.initializer {
     ...
   %1:723 - io_parameters.load<> ...... {
   }
    %fence_0 = hal.fence.create device(%__device_0 : !hal.device) flags("None") : !hal.fence
    %transient_buffer = hal.device.queue.alloca<%__device_0 : !hal.device> affinity(%c-1_i64) wait(%0) signal(%fence_0) pool(%c0_i64) type("DeviceVisible|DeviceLocal") usage("TransferSource|TransferTarget|Transfer|DispatchStorageRead|DispatchStorageWrite|DispatchStorage|SharingImmutable") : !hal.buffer{%c278010004096}
    %status = hal.fence.await until([%fence_0]) timeout_millis(%c-1_i32) : i32
//module.10.executable-targets.mlir
//module.9.executable-configurations.mlir
//module.8.executable-sources.mlir
//module.7.stream.mlir
module @module attributes {stream.affinity.default = #hal.device.affinity<@__device_0>} {
  ...
  util.initializer {
  ...
  %results:723, %result_timepoint = stream.parameter.load on(#hal.device.affinity<@__device_0>) {
  }
   %result, %result_timepoint_0 = stream.resource.alloca uninitialized on(#hal.device.affinity<@__device_0>) : !stream.resource<constant>{%c278010004096} => !stream.timepoint

After that I lose the allocation. Based on:

 %transient_buffer = hal.device.queue.alloca<%__device_0 : !hal.device> affinity(%c-1_i64) wait(%0) signal(%fence_0) pool(%c0_i64) type("DeviceVisible|DeviceLocal") 

It looks like we are allocating all possible transient data up-front, but hard to see

@benvanik
Copy link
Collaborator

yeah, we suballocate, produce a max value, and then allocate that - if you --mlir-print-ir-before=iree-stream-schedule-allocation / --mlir-print-ir-after=iree-stream-schedule-allocation it'll make it easier to see what's mapping to what

@AWoloszyn
Copy link
Contributor

AWoloszyn commented Oct 28, 2024

So following through the (before=iree-stream-schedule-allocation): %cst_1
We grab the parameter from the parameter pack, transpose it, and then put it elsewhere in memory. So that is one place where we are eating up quite a bit of memory.
One for the parameter in the parameter pack, one as the temporary, and then the final location in memory.

module @module attributes {stream.affinity.default = #hal.device.affinity<@__device_0>} {
  ...
    %results:1291, %result_timepoint = stream.async.execute on(#hal.device.affinity<@__device_0>) with() -> ... {
      ..
      %cst_1 = stream.async.constant : !stream.resource<constant>{%c134217728} = #stream.parameter.named<"model"::"blk.0.attn_q.weight"> : tensor<8192x8192xf16>
      %1:5 = stream.async.concurrent with(....  %cst_1 as %arg1: !stream.resource<constant>{%c134217728} ... ) {
        ...
        %7 = stream.async.dispatch @_initializer_2_dispatch_0::@_initializer_2_dispatch_0_transpose_8192x8192_f16(%arg1[%c0 to %c134217728 for %c134217728]) : (!stream.resource<constant>{%c134217728}) -> !stream.resource<constant>{%c134217728}
        ...
       stream.yield .. %7 ... : ... !stream.resource<constant>{%c134217728}, ...
      }
     ...
     stream.yield %cst, %cst_0, .... %1#1 ...
  }
  util.global.store %results#164, @__hoisted_tensor_8192x8192xf16 : !stream.resource<constant>

But it really looks like perhaps this transient size being enormous is more related to the fact that the initializer looks like

%result:1291 ... {
  %1:5 = stream.async.concurrent ....
  %3:561 = stream.async.concurrent ....
  %5:564 = stream.async.concurrent ...
  stream.yield (1291 values)
}

So we have to allocate all of the memory up-front to hold all of these results (and the results look MOSTLY like transposes of parameters).

So there are maybe 2 problems, but when our weight are 130GB by themselves, we really can't afford to have any copies of parameters around at all, even if we solved the transient buffer problem (which we should be able to do by serializing all of these and moving the transient to the final destination between each dispatch, or even just writing directly into the final location).

@benvanik
Copy link
Collaborator

Nice, you've found it - that's what I suspected. As you note when models get this big (though I'd argue for anything deployed of any size) we need to be baking out initializers into new parameter files and not doing this at runtime. I've got that on my TODO list. I believe in this case we are lucking out - if the !stream.resource<constant> is the only allocation then that is transposing them into the target. Is there another transient (!stream.resource<transient>) allocation?

@stellaraccident
Copy link
Collaborator

Baking out the parameter pack would be good. But in this case, the intent at the model level was to not have any parameter transpositions -- even if the compiler did it, data movement of this size is expensive. So the modeling tools make an effort to minimize that.

Of course, we may have gotten it precisely backwards. Or broken it in some other way.

I need to get debug info fixed so this is all less opaque.

@AWoloszyn
Copy link
Contributor

AWoloszyn commented Oct 28, 2024

Nice, you've found it - that's what I suspected. As you note when models get this big (though I'd argue for anything deployed of any size) we need to be baking out initializers into new parameter files and not doing this at runtime. I've got that on my TODO list. I believe in this case we are lucking out - if the !stream.resource<constant> is the only allocation then that is transposing them into the target. Is there another transient (!stream.resource<transient>) allocation?

Yes there is:

module.7.stream.mlir
%result_1, %result_timepoint_2 = stream.resource.alloca uninitialized on(#hal.device.affinity<@__device_0>) : !stream.resource<transient>{%c768} => !stream.timepoint

But 1) it is significantly smaller (768 bytes) and 2) SEEMS to be used only in a very small subset of the initialization dispatches.

@AWoloszyn
Copy link
Contributor

Baking out the parameter pack would be good. But in this case, the intent at the model level was to not have any parameter transpositions -- even if the compiler did it, data movement of this size is expensive. So the modeling tools make an effort to minimize that.

Of course, we may have gotten it precisely backwards. Or broken it in some other way.

I need to get debug info fixed so this is all less opaque.

In our input MLIR we have this:

util.func public @prefill_bs4$async(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view, %arg3: !hal.buffer_view, %arg4: !hal.fence, %arg5: !hal.fence) -> !hal.buffer_view attributes {inlining_policy = #util.inline.never, iree.abi.model = "coarse-fences", iree.abi.stub} {
...
%transposed = linalg.transpose ins(%__auto.blk.0.attn_q.weight : tensor<8192x8192xf16>) outs(%52 : tensor<8192x8192xf16>) permutation = [1, 0] 
...
%56 = linalg.matmul ins(%collapsed, %transposed : tensor<?x8192xf16>, tensor<8192x8192xf16>) outs(%55 : tensor<?x8192xf32>) -> tensor<?x8192xf32>

Which (as far as I can tell) is hoisted out in the HoistIntoGlobalsPass

@stellaraccident
Copy link
Collaborator

Am I taking crazy pills? I could have sworn we were being smarter than this. This is pretty basic...

Ok, wait. So the parameter at rest is already an ideal layout, but it is being transposed to feed into a regular mm. Really, that op should (somehow) become an mm with a transposed rhs.

In this case, we should be folding that transpose into the mm and then not hoisting anything. This is literally the most common sequence in ml inference and just needs to be right. I have a feeling that this folding is only happening as part of fusion after hoisting or something.

@stellaraccident
Copy link
Collaborator

This looks like some kind of folding issue. That transpose should never become "unglued" and hoisted separately.

@benvanik
Copy link
Collaborator

That's great news :)

Thinking for when cases worse than this arise something that we should do is have some analysis that forces stream partitioning to min-peak-memory when execution is happening transitively within an initializer. We want concurrency if we can get it (like here) but don't want to increase memory consumption more than required in the startup phase. This is controllable with an attribute today: stream.partitioning = #stream.partitioning_config<min-peak-memory> - and it can be on any region op to influence everything nested within. We've needed an analysis that tracks function reachability for awhile and if we had it we could have a pass that goes and adds those annotations prior to ScheduleExecutionPass/ScheduleConcurrencyPass.

@qedawkins
Copy link
Contributor

qedawkins commented Oct 28, 2024

It would be good for this to work even without the folder though because we'll be reaching for (almost) exactly this pattern with data tiling.

@MaheshRavishankar
Copy link
Contributor

Could you try with these flags

                --iree-dispatch-creation-enable-aggressive-fusion=true \
                --iree-global-opt-propagate-transposes=true \
                --iree-opt-aggressively-propagate-transposes=true \
                --iree-opt-data-tiling=false \
                --iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))" \

They should help with some unnecessary broadcasts/transposes etc. These are at this point effectively default for RoCM backend
I would use it all the time (We will try to turn it on, but that will need work)

@KyleHerndon
Copy link
Contributor Author

Same general error when running with those flags, at least on 405b. @aviator19941 said he would try out 70b.

@IanWood1
Copy link
Contributor

Adding --iree-opt-strip-assertions (which should be made default soon) with #19014 seems to resolve the issue for 70b

@kumardeepakamd
Copy link
Contributor

@aviator19941 the default should fix it now it seems, has anyone tried 70B in main branch and seen issues resolved?

@pdhirajkumarprasad
Copy link

pdhirajkumarprasad commented Dec 19, 2024

for generating benchmark/tracy profile, I am still hitting this issue with 70B and 405B fp16 during iree-run-module/benchmark

command:

for generating MLIR

python3 -m sharktank.examples.export_paged_llm_v1 \
  --bs=4 \
  --irpa-file=/data/llama3.1/weights/70b/fp16/llama3.1_70b_f16.irpa \
  --output-mlir=70b_fp16_prefill_nondecomposed.mlir \
  --output-config=70b_fp16_prefill_nondecomposed.json \
  --skip-decode

compilation

iree-compile 70b_fp16_prefill_nondecomposed.mlir \
  --iree-hip-target=gfx942 \
  -o=prefill_70b.vmfb \
  --iree-hal-target-device=hip \
  --iree-dispatch-creation-enable-aggressive-fusion=true \
  --iree-global-opt-propagate-transposes=true \
  --iree-opt-aggressively-propagate-transposes=true \
  --iree-opt-data-tiling=false \
  --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' \
  --iree-hal-indirect-command-buffers=true \
  --iree-stream-resource-memory-model=discrete \
  --iree-hip-legacy-sync=false \
  --iree-hal-memoization=true \
  --iree-opt-strip-assertions 

runtime

iree-benchmark-module --hip_use_streams=true --device_allocator=caching --module=prefill_405b.vmfb --parameters=model=/data/llama3.1/weights/405b/fp16/llama3.1_405b_fp16.irpa --device=hip://4 --function=prefill_bs4 --input=@/data/llama3.1/weights/70b/prefill_args/tokens.npy --input=@/data/llama3.1/weights/70b/prefill_args/seq_lens.npy --input=@/data/llama3.1/weights/70b/prefill_args/seq_block_ids.npy --input=@/data/llama3.1/weights/70b/prefill_args/cache_state_f16.npy

iree version : IREE compiler version 3.1.0rc20241218 @ 8ae1b54

@IanWood1
Copy link
Contributor

IanWood1 commented Dec 20, 2024

@pdhirajkumarprasad I haven't tried 405b and I'm not sure if it has ever worked. I could be wrong, but I think 405b is too large to run unsharded on a single mi300x.


I retried 70b with IREE at 83af679 and shark-ai at 7862ff8aef1cbc0ab5ceea48afebabef00402c09 and was able to get successful benchmark results with the same iree-compile command. However, I did have to change the iree-benchmark-module command:

iree-benchmark-module --hip_use_streams=true --device_allocator=caching --module=prefill_405b.vmfb --parameters=model=/data/llama3.1/weights/405b/fp16/llama3.1_405b_fp16.irpa --device=hip://4 --function=prefill_bs4 --input=@/data/llama3.1/weights/70b/prefill_args/tokens.npy --input=@/data/llama3.1/weights/70b/prefill_args/seq_lens.npy --input=@/data/llama3.1/weights/70b/prefill_args/seq_block_ids.npy --input=@/data/llama3.1/weights/70b/prefill_args/cache_state_f16.npy

I changed it to use 70b inputs/vmfb:

iree-benchmark-module --hip_use_streams=true --device_allocator=caching --module=prefill_70b.vmfb --parameters=model=/data/llama3.1/weights/70b/fp16/llama3.1_70b_f16.irpa --device=hip://2 --function=prefill_bs4 --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32/tokens.npy --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32/seq_lens.npy --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32/seq_block_ids.npy --input=@/data/llama3.1/weights/70b/prefill_args_bs4_128_stride_32/cs_f16.npy

I ran this on a SPX machine. CPX has 8x less memory, possibly causing the OOM? Or maybe there were other processes eating up vram? I'm not sure what could be causing this discrepancy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants