Is it still make sense to align structs for AVX ? #1243

gotzmann · 2023-04-29T21:48:01Z

It seems, Q4 / Q8 weights do not aligned within memory bounds / cache line size.

Like here 4bytes + 16bytes:

#define QK4_0 32
typedef struct {
    float   d;          // delta
    uint8_t qs[QK4_0 / 2];  // nibbles / quants
} block_q4_0;

I used to think that it better to align for 32/64 bytes for faster AVX2 / AVX512 (and there special ops to work with aligned vectors).

So not sure maybe modern CPUs do handle mis-aligned data easily or maybe we loose some performance here?

The text was updated successfully, but these errors were encountered:

prusnak · 2023-04-29T22:05:36Z

The struct is not packed, so its size is 20, not 18.
In other words, each field that has a length not divisible by 4 will get padded so that it becomes aligned to 4 bytes.
As a result, each struct is located at an address aligned to 4-byte boundaries.

Having said all that, I don't know much about AVX, so I'm not sure how aligning to 32/64 bytes would help.
However, if you were to align a 20-byte struct to 32 bytes, wouldn't the memory requirements increase to 160%?

SlyEcho · 2023-04-29T22:10:52Z

float is 4 bytes.

gotzmann · 2023-04-30T09:03:36Z

Having said all that, I don't know much about AVX, so I'm not sure how aligning to 32/64 bytes would help. However, if you were to align a 20-byte struct to 32 bytes, wouldn't the memory requirements increase to 160%?

I mean, we can store deltas in some other memory buffer to allow faster aligned AVX operations on the quants.

I'm not an expert in AVX too, so maybe if there pattern of sequential access (like matmul pipeline getting one struct after another), there no cache penalty at all for modern CPUs.

There definitely was a penalty for older CPUs, thus different AVX instructions for aligned / unaligned vectors.

github-actions · 2024-04-09T01:09:43Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

gjmulder added enhancement New feature or request performance Speed related topics labels May 2, 2023

sw mentioned this issue May 12, 2023

Why Q4 much faster than Q8 ? #1239

Closed

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it still make sense to align structs for AVX ? #1243

Is it still make sense to align structs for AVX ? #1243

gotzmann commented Apr 29, 2023 •

edited

Loading

prusnak commented Apr 29, 2023 •

edited

Loading

SlyEcho commented Apr 29, 2023

gotzmann commented Apr 30, 2023

github-actions bot commented Apr 9, 2024

Is it still make sense to align structs for AVX ? #1243

Is it still make sense to align structs for AVX ? #1243

Comments

gotzmann commented Apr 29, 2023 • edited Loading

prusnak commented Apr 29, 2023 • edited Loading

SlyEcho commented Apr 29, 2023

gotzmann commented Apr 30, 2023

github-actions bot commented Apr 9, 2024

gotzmann commented Apr 29, 2023 •

edited

Loading

prusnak commented Apr 29, 2023 •

edited

Loading