Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always use cmpxchg16b on x64 for atomic_ref<16 bytes> #4751

Merged
merged 20 commits into from
Jul 5, 2024

Conversation

StephanTLavavej
Copy link
Member

@StephanTLavavej StephanTLavavej commented Jun 28, 2024

Fixes #4481.

📜 Overview

The cmpxchg16b instruction (aka the _InterlockedCompareExchange128 intrinsic) was missing from the earliest x64 processors, but was quickly added and is now required by all modern OSes. This intrinsic (emitting an equivalent instruction sequence) has always been supported by ARM64.

Our implementation of C++11 atomic didn't attempt to take advantage of the then-fairly-new cmpxchg16b instruction. By the time our ABI was frozen in VS 2015, cmpxchg16b was more widespread, but we were still learning about concurrency and hadn't realized the potential for optimization here. Shortly after the ABI freeze, we added some (highly unusual for our conventions) preprocessor-disabled logic, indicating how we could take advantage of cmpxchg16b when we could break ABI, which still hasn't happened yet. So, while this disabled logic appears in atomic, it currently never uses _InterlockedCompareExchange128 (for either x64 or ARM64), and atomic<16 bytes> is always locking.

Adding to the complexity, our codebase referred to "DCAS", Double Compare-And-Swap, i.e. double the size of a pointer. For 32-bit architectures, that's a 64-bit compare-and-swap, which is always available. So the codebase said that DCAS was always available on x86, ARM32, and ARM64, but only sometimes available on x64.

When we merged #843 implementing C++20 atomic_ref on 2020-08-12 for VS 2019 16.8, cmpxchg16b was even more widespread. I now believe that we should have unconditionally used cmpxchg16b, giving the feature a processor requirement (similar to how shared_mutex and C++20 <chrono> had OS requirements). At the time, it seemed tempting to implement fallback logic. So, on 64-bit systems, atomic_ref<16 bytes> doesn't store any space for a lock, but if at runtime the STL discovers that it's running on x64 without cmpxchg16b, it uses a global lock.

All of this machinery makes our sources very complicated, and it also degrades atomic_ref<16 bytes> performance on 64-bit systems. I believe it's finally time to rip out this machinery, simplifying our codebase and improving performance. I've talked to @MahmoudGSaleh and he agrees.

🎯 Impact

I carefully performed these changes step-by-step, although they're easier to review as a whole. These changes preserve bincompat, in the sense that old and new object files can be freely mixed. (We always require that the final link be performed by the newest toolset and that the VCRedist is also the newest version, but this PR isn't exercising those requirements.)

As previously mentioned, while atomic's preprocessor-disabled logic is being cleaned up, atomic is physically unaffected by this PR. atomic<16 bytes> is always locking (with a per-object lock), and will remain so for the stable ABI era.

On 64-bit systems, C++20 atomic_ref<16 bytes> now uses _InterlockedCompareExchange128 header-only. This improves performance by eliminating a separately compiled function call (for both x64 and ARM64), and by eliminating a cached runtime support check (for x64).

The old dllexports are preserved for bincompat, but they now unconditionally use cmpxchg16b. (This affects the VCRedist, which is unlocked for VS 2022 17.12, but we don't require close coordination between the header and DLL changes.)

Together, these changes theoretically break one scenario, but it is so vanishingly improbable that I don't believe we need to worry about it now (and Mahmoud agreed). Because this is using a processor instruction, not an OS API, the code has to actually be executed on an old processor to cause problems - merely loading the STL DLL won't be blocked.

🥔 New Code + Ancient OS + Potato Chip = Won't Happen

Win8.1 / Server 2012 R2 require cmpxchg16b on x64.

Win7 / Server 2008 R2 are no longer supported by the STL after #4742.

Win8.0 client is no longer supported by Microsoft. As Wikipedia mentions, support for Win8.0 ended in Jan 2016, with the exception of Windows Embedded 8 Standard which ended in Jul 2023.

That leaves Server 2012 classic, which is receiving (paid) extended security updates until Oct 2026.

Consider how many stars would have to align for this PR to be a problem:

  • A machine would have to be running Server 2012 classic.
    • It was released in Sep 2012 and was the latest version for only one year, before Server 2012 R2 was released in Oct 2013.
  • It would have to be 64-bit.
    • 32-bit servers, while nonsensical today, were a thing back then.
  • It would have to be running on an absolute potato chip, old even when Server 2012 classic was new.
    • Support for cmpxchg16b appeared in Intel Nehalem released in Nov 2008, and AMD Bulldozer released in Oct 2011. (It may have appeared in even earlier processor architectures, I didn't do exhaustive research here.)
  • That hardware would need to still be working today - most such machines have simply failed by now.
  • Then, that ancient OS running on a potato chip would need to be getting bleeding-edge software deployments, with new code using:
    • /std:c++20 (not a common option yet),
    • atomic_ref (not a commonly used feature, even in C++20 codebases),
    • atomic_ref<16 bytes> specifically (no other size is affected).

In this highly specific case, this PR will attempt to unconditionally use the cmpxchg16b instruction, which will fail.

I believe that this scenario is sufficiently implausible that we don't need to worry about it (i.e. I don't believe anyone is actually doing this), even with the caveat that deploying an updated VCRedist would be sufficient to cause it, because it would still require C++20 atomic_ref<16 bytes>-using code to have been deployed to the server. (And remember, such code would have gotten terrible performance from the fallback due to the global mutex.)

As previously mentioned, because this PR is using a new instruction and not a new OS API, its impact is limited to this specific scenario only. In general, Server 2012 classic will still be supported by the STL, for the 2 years that this OS has remaining.

⚙️ Source Changes

My individual commits have detailed changelogs, but as previously mentioned it's simpler to review the diff as a whole, and look for these notable things:

<yvals_core.h>

  • _STL_WIN32_WINNT_WINBLUE (i.e. Win8.1) will be unused after these changes.

<yvals.h>

  • _STD_ATOMIC_ALWAYS_USE_CMPXCHG16B is going away - effectively, it will be 1 iff we're 64-bit.
    • Note that despite the name mentioning CMPXCHG16B, it also applied to ARM64.
    • This is where we compared against _STL_WIN32_WINNT_WINBLUE.

<atomic>

  • _STD_COMPARE_EXCHANGE_128 is going away - we're going to use _InterlockedCompareExchange128 directly in the header.
  • __std_atomic_compare_exchange_128() and __std_atomic_has_cmpxchg16b() are being preserved for bincompat, so we don't need to declare them anymore.
    • They're still dllexported by stl/src/msvcp_atomic_wait.src.
    • Observe that their declarations were guarded by #ifdef _WIN64, so 32-bit code never saw these declarations. This will be relevant below.
    • Their declarations were also guarded by _STD_ATOMIC_ALWAYS_USE_CMPXCHG16B == 0, so ARM64 never saw these declarations either.
  • _ATOMIC_HAS_DCAS is going away - effectively, it will always be 1.
    • Note that the comment "Controls whether atomic::is_always_lock_free triggers for sizeof(void *) or 2 * sizeof(void *)" was inaccurate - it affected atomic only in the preprocessor-disabled "break ABI" codepaths. It actively affected atomic_ref, though.
  • By default, Clang requires the -mcx16 compiler option to use the _InterlockedCompareExchange128 intrinsic. Fortunately, there's a way to make this work without requiring users to alter their command lines. We can apply the target attribute to functions, making them effectively compiled with -mcx16. We could manually mark each usage this way, but there's an even more convenient technique - Clang has a push/pop #pragma clang attribute that will apply the attribute to all functions in a region. For hygiene, I'm inventing a _STD_ATOMIC_HEADER "namespace" for this.
  • 1632-1635: As previously mentioned, only the "break ABI" codepaths for atomic are being changed, with DCAS always being used.
  • 2154-2156: Now that we never query for cmpxchg16b's availability at runtime, the current "don't break ABI" and future "break ABI" paths for is_lock_free() report the same answer as is_always_lock_free (which varies above), so this all collapses away. Again, this leaves the current ABI behavior unaffected (it's true for exactly 1, 2, 4, and 8).
  • 2307, 2318-2325: For atomic_ref, "potentially lock-free" becomes "always lock-free", greatly simplifying this code.
  • 2998, 3004: Drop #undefs for removed macros.
  • 3005-3008: Pop the Clang pragma.

atomic_wait.cpp

  • 8: We were already calling abort() in this file, but let's be good kitties and include <cstdlib> for it.
  • 94: Drop the fallback implementation. This was in an unnamed namespace, so no ABI impact.
  • 235-256: Mark __std_atomic_compare_exchange_128() and __std_atomic_has_cmpxchg16b() as preserved for bincompat, and simplify their implementations.
    • They were never declared for 32-bit users, so we can simply abort().
    • They were also never declared for _M_ARM64. However, I don't understand ARM64EC well enough to be absolutely confident that we could further restrict the implementations here to _M_X64, and there would be little value in doing so.

P0019R8_atomic_ref/env.lst

  • No more mutants modes.

P0019R8_atomic_ref/test.cpp

  • We don't need to skip this test in certain modes anymore.
  • atomic_ref<two_pointers_t> is always lock-free now.
  • Drop a comment about our tests assuming Win8+.

…e for 64-bit Win8.1 / Server 2012 R2.

https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-isprocessorfeaturepresent
> PF_COMPARE_EXCHANGE128
> The atomic compare and exchange 128-bit operation (cmpxchg16b) is available.

https://support.microsoft.com/en-us/windows/system-requirements-2f327e5a-2bae-4011-8848-58180a4353a7
> To install a 64-bit OS on a 64-bit PC, your processor needs to support CMPXCHG16b, PrefetchW, and LAHF/SAHF

As mentioned by
https://en.wikipedia.org/wiki/Windows_8
and confirmed by
https://learn.microsoft.com/en-us/lifecycle/products/windows-8
https://learn.microsoft.com/en-us/lifecycle/products/windows-embedded-8-standard
support for Win8.0 client ended in Jan 2016, with the exception of Windows Embedded 8 Standard which ended in Jul 2023.

The one we have to worry about is Server 2012 classic, which is receiving extended security updates through Oct 2026:
https://learn.microsoft.com/en-us/lifecycle/products/windows-server-2012
But only C++20 `atomic_ref<16 bytes>` is physically affected, and only on potato chips, and only for x64. (32-bit servers would be unaffected.) The potential impact is nil.

Server 2012 classic was released in Sep 2012 and was the latest version for only one year, before Server 2012 R2 was released in Oct 2013.

cmpxchg16b support appeared even earlier: in Intel Nehalem released in Nov 2008, and AMD Bulldozer released in Oct 2011.
… should `abort` for 32-bit.

We can do this because they were declared in `<atomic>` for 64-bit only;
there were never any 32-bit callers. (They have to remain dllexported for 32-bit, but that's all.)

Include `<cstdlib>` for `abort` (which we were already calling in this file).

We can drop `__std_atomic_compare_exchange_128_fallback` without ABI impact because it was defined in an unnamed namespace.
Drop the declaration, mark the definition as preserved for bincompat. It's still dllexported by `stl/src/msvcp_atomic_wait.src`.

Directly use the `_InterlockedCompareExchange128` intrinsic.
Now that `__std_atomic_has_cmpxchg16b()` always returns `true`, this collapses into the `_ATOMIC_HAS_DCAS` case.
…in "current ABI" `atomic<T>::is_lock_free()`.
We don't need to replicate the signature, nor does `constexpr bool _Result` improve clarity.
…lways_lock_free`).

`atomic<T>::is_lock_free()` collapsed into the `_ATOMIC_HAS_DCAS` case, so this does too.
`__std_atomic_has_cmpxchg16b()` always returns `true` now.
For `!_ATOMIC_HAS_DCAS`:
`is_always_lock_free` (power of two, at most 1 pointer) is a strict subset of `_Is_potentially_lock_free` (power of two, at most 2 pointers).
This describes what `atomic_ref::is_lock_free()` now returns.
Again, this all collapses into the `_ATOMIC_HAS_DCAS` case.
Drop the declaration, mark the definition as preserved for bincompat. It's still dllexported by `stl/src/msvcp_atomic_wait.src`.
It's now unused by product code, so we can drop its definition and consistency checks.

Then, `P0019R8_atomic_ref` can assume a single mode.
@StephanTLavavej StephanTLavavej added the performance Must go faster label Jun 28, 2024
@StephanTLavavej StephanTLavavej requested a review from a team as a code owner June 28, 2024 22:18
Copy link
Member

@BillyONeal BillyONeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the original author of all these polyfills,🔥 them, 🔥 them to the ground

🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥

@AlexGuteniev

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej self-assigned this Jul 4, 2024
@StephanTLavavej
Copy link
Member Author

I'm speculatively mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

Copy link
Member

@MahmoudGSaleh MahmoudGSaleh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StephanTLavavej StephanTLavavej merged commit ef1d621 into microsoft:main Jul 5, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

<atomic>: Avoid checks for cmpxchg16b availability
4 participants