Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDXL sample app is broken on latest nightly #753

Open
amd-chrissosa opened this issue Jan 4, 2025 · 7 comments
Open

SDXL sample app is broken on latest nightly #753

amd-chrissosa opened this issue Jan 4, 2025 · 7 comments
Assignees

Comments

@amd-chrissosa
Copy link
Contributor

Ran through the user guide as part of kicking off the release. Server app starts up normally however throws an error on any client request:

[2025-01-04 00:59:21.078] [error] [service.py:384] Fatal error in image generation
Traceback (most recent call last):
File "/home/sosa/3.11.venv/lib/python3.11/site-packages/shortfin_apps/sd/components/service.py", line 373, in run
await self._decode(device=device0, requests=self.exec_requests)
File "/home/sosa/3.11.venv/lib/python3.11/site-packages/shortfin_apps/sd/components/service.py", line 657, in _decode
(image,) = await fn(latents, fiber=self.fiber)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: ValueError: shortfin_iree-src/runtime/src/iree/hal/drivers/hip/event_semaphore.c:359: ABORTED; while calling import; while invoking native function hal.device.queue.dealloca;
[ 0] bytecode compiled_vae.decode$async:484 genfiles/sdxl/stable_diffusion_xl_base_1_0_vae_bs1_1024x1024_fp16.mlir:142:3
[2025-01-04 00:59:21.079] [info] [metrics.py:51] Completed inference process (batch size 1) in 1058ms
[2025-01-04 00:59:21] 127.0.0.1:39728 - "POST /generate HTTP/1.1" 200

I've tried to use different device ids with no avail as I thought maybe I was contending with other processes on the same machine. Llama sample app works fine on the same machine.

@monorimet
Copy link
Contributor

First guess is an iree runtime regression. We can probably avoid this for now by turning off async allocations, but ideally we fix or revert IREE before release as it will impact performance.

@monorimet
Copy link
Contributor

monorimet commented Jan 6, 2025

Perhaps a separate issue -- I notice the SDXL test has been failing as shown:
https://github.com/nod-ai/shark-ai/actions/runs/12606573635/job/35136865226#step:7:265

free(): double free detected in tcache 2
Error sending the request: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

This furthers my suspicion that there has been an IREE runtime regression. We had some significant changes land in IREE main that may need a bit of attention to smooth out downstream wrinkles.
I am currently unable to reproduce the original issue here. @amd-chrissosa can you share the shortfin/IREE versions and server command used? I tested with various server topologies (notably with all canonical permutations of allocation options etc.) and did not encounter issues.

@AWoloszyn
Copy link
Contributor

The double free should be fixed by iree-org/iree#19583

@AWoloszyn
Copy link
Contributor

If you run with AMD_LOG_LEVEL=1 env var does anything interesting show up?

@monorimet
Copy link
Contributor

monorimet commented Jan 7, 2025

I do encounter a segmentation fault only when the workers are under load, using async allocations.

If I switch on caching allocator (or switch off async allocations), the segfault does not occur.

This is what is printed out at segfault with amd_log_level=1

:1:rocdevice.cpp            :2381: 387602872230 us: [pid:2107753 tid:0x7fa58c3e0640] Fail allocation local memory
:1:rocdevice.cpp            :2103: 387602872267 us: [pid:2107753 tid:0x7fa58c3e0640] Failed creating memory
:1:memory.cpp               :358 : 387602872270 us: [pid:2107753 tid:0x7fa58c3e0640] Video memory allocation failed!
:1:memory.cpp               :318 : 387602872273 us: [pid:2107753 tid:0x7fa58c3e0640] Can't allocate memory size - 0xA0200800 bytes!
:1:rocdevice.cpp            :2436: 387602872277 us: [pid:2107753 tid:0x7fa58c3e0640] failed to create a svm hidden buffer!
:1:memory.cpp               :1531: 387602872281 us: [pid:2107753 tid:0x7fa58c3e0640] Unable to allocate aligned memory
:1:hip_memory.cpp           :329 : 387602872291 us: [pid:2107753 tid:0x7fa58c3e0640] Allocation failed : Device memory : required :2686453760 | free :167772160 | total :206141652992
Segmentation fault (core dumped)

@monorimet
Copy link
Contributor

Reopening as this should not have been closed.

@monorimet monorimet reopened this Jan 7, 2025
monorimet added a commit that referenced this issue Jan 7, 2025
… default. (#768)

This is a band-aid patch.

We encountered a regression in under-load behavior (tracked by
#753)
This effectively uses a different allocator that, while less efficient
at multi-device inference execution, is more stable.
monorimet added a commit that referenced this issue Jan 8, 2025
… default. (#768)

This is a band-aid patch.

We encountered a regression in under-load behavior (tracked by
#753)
This effectively uses a different allocator that, while less efficient
at multi-device inference execution, is more stable.
@ScottTodd
Copy link
Member

I think we can call this fixed? Anything left to follow up on? Maybe turning off the caching allocator, @monorimet ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants