Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it still make sense to align structs for AVX ? #1243

Closed
gotzmann opened this issue Apr 29, 2023 · 4 comments
Closed

Is it still make sense to align structs for AVX ? #1243

gotzmann opened this issue Apr 29, 2023 · 4 comments
Labels
enhancement New feature or request performance Speed related topics stale

Comments

@gotzmann
Copy link

gotzmann commented Apr 29, 2023

It seems, Q4 / Q8 weights do not aligned within memory bounds / cache line size.

Like here 4bytes + 16bytes:

#define QK4_0 32
typedef struct {
    float   d;          // delta
    uint8_t qs[QK4_0 / 2];  // nibbles / quants
} block_q4_0;

I used to think that it better to align for 32/64 bytes for faster AVX2 / AVX512 (and there special ops to work with aligned vectors).

So not sure maybe modern CPUs do handle mis-aligned data easily or maybe we loose some performance here?

@prusnak
Copy link
Collaborator

prusnak commented Apr 29, 2023

The struct is not packed, so its size is 20, not 18.
In other words, each field that has a length not divisible by 4 will get padded so that it becomes aligned to 4 bytes.
As a result, each struct is located at an address aligned to 4-byte boundaries.

Having said all that, I don't know much about AVX, so I'm not sure how aligning to 32/64 bytes would help.
However, if you were to align a 20-byte struct to 32 bytes, wouldn't the memory requirements increase to 160%?

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 29, 2023

float is 4 bytes.

@gotzmann
Copy link
Author

Having said all that, I don't know much about AVX, so I'm not sure how aligning to 32/64 bytes would help. However, if you were to align a 20-byte struct to 32 bytes, wouldn't the memory requirements increase to 160%?

I mean, we can store deltas in some other memory buffer to allow faster aligned AVX operations on the quants.

I'm not an expert in AVX too, so maybe if there pattern of sequential access (like matmul pipeline getting one struct after another), there no cache penalty at all for modern CPUs.

There definitely was a penalty for older CPUs, thus different AVX instructions for aligned / unaligned vectors.

@gjmulder gjmulder added enhancement New feature or request performance Speed related topics labels May 2, 2023
@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Speed related topics stale
Projects
None yet
Development

No branches or pull requests

4 participants