Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argon2: add workaround for big 64-byte aligned allocations? #573

Open
newpavlov opened this issue Mar 5, 2025 · 5 comments
Open

argon2: add workaround for big 64-byte aligned allocations? #573

newpavlov opened this issue Mar 5, 2025 · 5 comments

Comments

@newpavlov
Copy link
Member

newpavlov commented Mar 5, 2025

It was previously discussed in #566.

The claim is that for big allocations we can get higher performance by allocating len + 64 bytes with 1-byte alignment (we then can manually construct a 64-byte aligned region in the allocated memory with size len), than by directly allocating len bytes with 64-byte alignment.

cc @jonasmalacofilho

@jonasmalacofilho
Copy link

than by directly allocating len bytes with 16-byte alignment.

Small correction: the issue is with allocations using alignment greater than 16 (or, more generally, more than the maximum alignment supported by calloc).

In particular, the case we care about is allocating Blocks, which are 64-byte aligned.

@newpavlov
Copy link
Member Author

Oh, you are right. Fixed.

BTW do we really need the 64 byte alignment in the first place? IIUC this alignment is too strict for SIMD vectors and it looks like an optimization which accounts for cache line size.

@newpavlov newpavlov changed the title argon2: add workaround for big 16-byte aligned allocations? argon2: add workaround for big 64-byte aligned allocations? Mar 5, 2025
@jonasmalacofilho
Copy link

jonasmalacofilho commented Mar 5, 2025

Yes, (I think) it's more about cache line size and, specifically, preventing false sharing. It gives a ~5% improvement over 16-byte alignment, if I recall correctly.

I also tried 128-byte alignment, which in theory makes sense for modern 64-bit architectures (including x86-64), and is the value adopted by most general solutions for false sharing (e.g. crossbeam::utils::CachePadded) on these platforms, but any further improvements were offset by more instructions being generated, at least when I tested it (c0ce7f9).

It probably also matters that in Argon2 we can't false share any block, just blocks on the boundaries of the slices. False sharing here isn't as much of an issue as it can be in other cases.

@newpavlov
Copy link
Member Author

Have you tried to directly mmap the memory?

@jonasmalacofilho
Copy link

No, I haven't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants