Corrupted outputs with Marlin int4 kernels as parallelization increases #332

dacorvo · 2024-10-06T14:35:24Z

When using MarlinInt4WeightQBitsTensor and its associated optimized gemm kernel, there are issues with the weight/scales/zero-point readback as soon as parallelization increases.

The consequence is that output features higher than 128 are corrupted when a sufficient amount of inputs are parallelized.

Test to reproduce the issue here:

optimum-quanto/test/tensor/weights/optimized/test_marlin_int4_weight_qbits_tensor.py

Line 134 in 852bb9c

@pytest.mark.xfail(reason="Bug in Marlin kernel", strict=False)

inarikami · 2024-10-12T01:43:11Z

Could be related, but I noticed the latest release of optimum-quanto (v0.2.5) corrupts transformer weights during qfloat8 quantization. Downgrading to 0.2.4 solved this issue. Not sure what the exact cause is but will look into it

Code that caused corruption in 0.2.5 but not earlier versions:

pipe = FluxPipeline.from_pretrained(...

quantize(pipe.transformer, weights=qfloat8)
freeze(pipe.transformer)

quantize(pipe.text_encoder, weights=qfloat8)
freeze(pipe.text_encoder)

quantize(pipe.text_encoder_2, weights=qfloat8)
freeze(pipe.text_encoder_2)

Leommm-byte · 2024-10-12T15:33:35Z

Could be related, but I noticed the latest release of optimum-quanto (v0.2.5) corrupts transformer weights during qfloat8 quantization. Downgrading to 0.2.4 solved this issue. Not sure what the exact cause is but will look into it

Code that caused corruption in 0.2.5 but not earlier versions:
pipe = FluxPipeline.from_pretrained(...

quantize(pipe.transformer, weights=qfloat8)
freeze(pipe.transformer)

quantize(pipe.text_encoder, weights=qfloat8)
freeze(pipe.text_encoder)

quantize(pipe.text_encoder_2, weights=qfloat8)
freeze(pipe.text_encoder_2)

Yeah, same here. I was confused at first because the generated image was just pure noise so I downgraded to this version

https://github.com/huggingface/optimum-quanto.git@65ace79d6af6ccc27afbb3576541cc36b3e3a98b

and it worked fine. (This was the 0.25.0.dev0)

dacorvo · 2024-10-14T16:14:15Z

@inarikami @Leommm-byte this cannot be related, as the new Marlin kernel is only available for qint4 and is not used by default.
If you have a script allowing to reproduce the corruption on 0.0.25, feel free to open an issue.

github-actions · 2024-11-14T02:02:23Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

dacorvo added bug Something isn't working help wanted Extra attention is needed labels Oct 6, 2024

dacorvo mentioned this issue Oct 6, 2024

Add marlin int4 kernel #333

Merged

dacorvo mentioned this issue Nov 10, 2024

Only random noise is generated with Flux + LoRA with optimum-quanto >= 0.2.5 #343

Open

github-actions bot added the Stale label Nov 14, 2024

dacorvo removed the Stale label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted outputs with Marlin int4 kernels as parallelization increases #332

Corrupted outputs with Marlin int4 kernels as parallelization increases #332

dacorvo commented Oct 6, 2024 •

edited

Loading

inarikami commented Oct 12, 2024 •

edited

Loading

Leommm-byte commented Oct 12, 2024

dacorvo commented Oct 14, 2024

github-actions bot commented Nov 14, 2024

Corrupted outputs with Marlin int4 kernels as parallelization increases #332

Corrupted outputs with Marlin int4 kernels as parallelization increases #332

Comments

dacorvo commented Oct 6, 2024 • edited Loading

inarikami commented Oct 12, 2024 • edited Loading

Leommm-byte commented Oct 12, 2024

dacorvo commented Oct 14, 2024

github-actions bot commented Nov 14, 2024

dacorvo commented Oct 6, 2024 •

edited

Loading

inarikami commented Oct 12, 2024 •

edited

Loading