Consider adding a LOCK CMPXCHG16B intrinsic method #28711

gdkchan · 2019-02-17T20:09:58Z

The CMPXCHG16B instruction is required to do CAS or atomic read of 128-bits values in memory. Currently, atomic 64-bits read and CAS is supported on .NET with the Interlocked.CompareExchange and Interlocked.Read, however the same operations are not support for 128-bits values.

I believe that the main problem is that the required instruction (CMPXCHG16B) is not supported on all CPUs, for example, it is not supported by some very old AMD CPUs, however it is a requirement to run Windows 8.1 and 10, so I beleive that the amount of CPUs were this instruction is not supported is very small.

Due to the above limitation, I beleive that the best way to support it is through an intrinsic, and the user can check if the instruction is supported on the current CPU, much like the other IsSupported properties that are exposed on the other ISA classes. The API would be something like this:

namespace System.Runtime.Intrinsics.X86
{
    public static class Cx16
    {
        // Cx16 flag check using the CPUID instruction (cached).
        public static bool IsSupported { get; }
        
        // Returns the old value at destination.
        public static Int128 InterlockedCompareExchange16Bytes(Int128* destination, Int128 value, Int128 comparand) { throw new PlatformNotSupportedException(); }
        
        // Returns true if the store was successful (*destination == comparand), and false otherwise.
        public static bool InterlockedCompareExchange16BytesEqual(Int128* destination, Int128 value, Int128 comparand) { throw new PlatformNotSupportedException(); }
    }
}

It uses an Int128 type that is not yet available, but AFAIK work is being done to add it (dotnet/corefxlab#2635).
Another alternative is passing the value as two 64-bits values (the low and high parts of the 128-bits value). Afterall, the instruction uses 2 64-bits registers. I beleive the main problem which this solution is returning the 128-bits value.

The CMPXCHG16B sets the zero flag, if the values at destination and the comparand are equal, and clears it otherwise. So, I included a method that returns bool (it would just return the ZF value basically), since it should have better codegen for the case where the user just wants to know if the two values are equal, and the store succeeded. On some cases, getting the value that is currently at destination is necessary (for example, when the user just wants to do a atomic 128-bits read), so in this case, the method returning a Int128 can be used (an example is provided below, with the AtomicRead128 method). The method returning a bool can be replaced with the one returning a Int128, by comparing the returned value with the comparand value, it has slightly worse codegen, but the same end result.

It's also worth noting that this instruction has alignment requirements, and the address should be 16 bytes aligned. I believe that the LoadAligned SSE intrinsic method had a similar problem, so peharps this can be handled in a similar way?

Example usage, an atomic 128-bits increment, just for illustration purposes:

public static Int128 AtomicIncrement128(Int128* destination)
{
    Int128 oldValue, newValue;
    
    do
    {
        oldValue = AtomicRead128(destination);
        newValue = oldValue + 1;
    }
    while (!InterlockedCompareExchange16BytesEqual(destination, newValue, oldValue);

    return oldValue;
}

private static Int128 AtomicRead128(Int128* source)
{
    // Note: Will cause an access violation for read-only mapped regions,
    // because CMPXCHG16B always performs a write, even if the the store "fails".
    return InterlockedCompareExchange16Bytes(source, Int128.Zero, Int128.Zero);
}

It may be worth noting (in case a implementation on Interlocked is desired) that it's also possible to implement this on ARM64, by using LDAXP/CMP/STLXP instruction sequences with two 64-bits registers.

The text was updated successfully, but these errors were encountered:

jduncanator · 2019-02-17T22:50:04Z

Thinking about this a little, a more appropriate signature might be one using an out parameter for the old value, something like:

bool InterlockedCompareExchange16Bytes(Int128* destination, Int128 value, Int128 comparand, out Int128 oldValue);

This would consolidate both InterlockedCompareExchange16Bytes and InterlockedCompareExchange16BytesEqual into a single method that covered both use cases.

This would be fairly trivial to implement, as both the comparand and the "old value" (read from destination on failure) are stored in RDX:RAX, so an unconditional copy from RDX:RAX to oldValue at the end of the function call would be enough to handle setting oldValue.

tannergooding · 2020-03-25T18:06:37Z

We can't support this without either taking in a set of void* or exposing a new Int128 type

jkotas · 2020-03-25T18:10:38Z

This is related/duplicate of #31911 . #31911 is a more general proposal.

tannergooding · 2020-03-25T18:16:44Z

I thought we couldn't support CMPXCHG16B on 32-bit systems due to tearing and so it needed to be an intrinsic?

jkotas · 2020-03-25T19:06:03Z

#31911 avoids that problem by targeting pairs of pointer-sized items only.

jkotas · 2020-03-25T19:08:58Z

But you are right that there are still problems with alignment with #31911 that would need to be solved (on both 32-bit and 64-bit platforms).

omariom · 2021-04-30T12:18:57Z

@jkotas

there are still problems with alignment

Then may be it should go Unsafe?
Or a special unsafe subset of intrinsics.

ayende · 2021-07-22T18:25:28Z

How about handling it with:

bool InterlockedCompareDCAS(Span<nint> destination, ReadOnly<nint>value, ReadOnlySpan<nint>comparand, Span<nint>oldValue);

With the idea that nint will sort it out?

This can be really useful for many scenarios, and the other options is:
https://gist.github.com/jduncanator/ab17e4e476300d3eb0b7c19f6f38429a

tannergooding · 2022-07-30T00:45:15Z

But you are right that there are still problems with alignment with #31911

There is also the consideration that while CMPXCHG8B is a "baseline" instruction, CMPXCHG16B is a newer instruction and not guaranteed to be available (similar for Arm64 since Atomics aren't required).

We'd end up needing some IsSupported API to avoid any issues with tearing.

gdkchan · 2022-07-31T03:15:11Z

(similar for Arm64 since Atomics aren't required).

It's possible to implement this without the newer Armv8.1 atomic instructions, using the LDAXP/STLXP instructions as I noted at the end of my first post, so all Arm64 CPUs supports this. The CASPAL instruction is still preferred if supported. You can see what clang generates for 128-bit atomic compare and swap on Arm64 here: https://godbolt.org/z/vra484Yab

GSPP · 2022-10-09T21:47:28Z

Shall this API take a pointer or a ref Int128? The ref version plays better with the GC but it kind of hides the alignment requirements.

colejohnson66 · 2024-07-18T11:26:34Z

Could an extension of #65184 (one with an IsSupported<T>()) cover this?

neon-sunset · 2025-02-18T19:14:01Z

Not having access to CMPXCHG16B continues to be a challenge for writing high-performance concurrent data storage primitives which is completely self-inflicted given .NET's stance of providing atomics and platform-specific intrinsics.

Existing Interlocked.CompareExchange128 from Win32 could have been a great start to copy 1:1.

There are also now generic overloads for Interlocked.CompareExchange. Support for 128bit operations could be provided down-level for structs which are <= 16B and satisfy unmanaged constraint.

timcassell · 2025-02-18T19:47:21Z

There are also now generic overloads for Interlocked.CompareExchange. Support for 128bit operations could be provided down-level for structs which are <= 16B and satisfy unmanaged constraint.

I asked about that in the PR where the constraint was removed. The response #104558 (comment) prompted me to open #105054 for a more general solution.

tannergooding · 2025-02-18T19:49:28Z

16-byte compare exchange continues being difficult to expose in a cross platform manner due to it not being portable, having various strict requirements, etc.

It might be feasible to expose platform speicifc intrinsics, but someone would need to open such a proposal. It would also be required to take pointers (not ref) and likely deal with a tuple (not an Int128 or similar) due to how the underlying instructions actually work.

colejohnson66 · 2025-02-18T19:54:43Z

Why would it need pointers? Aren't refs just safe/managed pointers under the hood?

tannergooding · 2025-02-18T19:56:00Z

Because cmpxchg16b and similar require strict alignment, which isn't something you can guarantee given a ref T, since the GC can relocate said memory and nothing supports 16 byte GC alignment today.

It's fine for long because that has a natural alignment of 8 bytes on 64-bit and is emulated (via an expensive loop) on 32-bit if the platform doesn't guarantee 8-byte alignment for 64-bit data types.

timcassell · 2025-02-18T20:21:36Z

Because cmpxchg16b and similar require strict alignment, which isn't something you can guarantee given a ref T, since the GC can relocate said memory and nothing supports 16 byte GC alignment today.

For a new type BitwiseAtomic<T> (from #105054), given T is struct RefPair { public object ref1, ref2; }, couldn't BitwiseAtomic<RefPair> have its alignment set to 16? I imagine there could be some algorithm for the runtime to determine the alignment of BitwiseAtomic<T> based on the T, like

// pseudo
sizeof(T) > 16 ? max(alignof(T), alignof(SpinLock)) // Falls back to SpinLock, so the 16B alignment is unnecessary
    : sizeof(T) > 8 ? 16 // 16B alignment for cmpxchg16b
    : 8 // 8B alignment for cmpxchg

tannergooding · 2025-02-18T21:11:55Z

couldn't BitwiseAtomic have its alignment set to 16?

The GC has no support for such alignment today. If it did, then Int128, Vector128<T>, and some other types (which are ABI primitives and have a natural alignment of 16) would all be sufficient. Today, we only pack these types correctly (thus, struct S { byte x; Int128 y; } has a size of 32, but is only 8-byte aligned on a typical 64-bit computer).

The GC adding such support is complex and is tracked by various other issues, all of which have been indicated by the GC team to be very complex and potentially not "pay for play". To my knowledge the GC is also not in a position to be able to even test the benefit vs drawbacks of supporting new alignments or changing the default alignment.

colejohnson66 · 2025-02-19T14:06:19Z

If I'm understanding correctly, a "simple" alignment attribute like below would be very difficult to implement, especially considering GC interactions like compaction?

namespace System.Runtime.InteropServices;

[AttributeUsage(AttributeTargets.Class | AttributeTargets.Struct | AttributeTargets.Field)]
public sealed class AlignmentAttribute(int alignment)
{
    public int Alignment { get; } = alignment;
}

jkotas · 2025-02-19T14:16:10Z

a "simple" alignment attribute like below would be very difficult to implement, especially considering GC interactions like compaction?

#22990 (comment) is a discussion about what it would take.

msftgits transferred this issue from dotnet/corefx Feb 1, 2020

msftgits added this to the Future milestone Feb 1, 2020

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020

tannergooding added api-suggestion Early API idea and discussion, it is NOT ready for implementation arch-x64 and removed untriaged New issue has not been triaged by the area owner labels Mar 25, 2020

eiriktsarpalis mentioned this issue Sep 30, 2022

JsonDocument.GetString implementation is not thread safe #76440

Closed

stephentoub modified the milestones: Future, 8.0.0 Oct 23, 2022

dakersnar mentioned this issue Nov 29, 2022

System.Runtime.Intrinsics work planned for .NET 8 #79005

Closed

13 tasks

tannergooding modified the milestones: 8.0.0, Future Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding a LOCK CMPXCHG16B intrinsic method #28711

Consider adding a LOCK CMPXCHG16B intrinsic method #28711

gdkchan commented Feb 17, 2019

jduncanator commented Feb 17, 2019 •

edited

Loading

tannergooding commented Mar 25, 2020

jkotas commented Mar 25, 2020

tannergooding commented Mar 25, 2020

jkotas commented Mar 25, 2020

jkotas commented Mar 25, 2020

omariom commented Apr 30, 2021

ayende commented Jul 22, 2021

tannergooding commented Jul 30, 2022

gdkchan commented Jul 31, 2022

GSPP commented Oct 9, 2022

colejohnson66 commented Jul 18, 2024

neon-sunset commented Feb 18, 2025 •

edited

Loading

timcassell commented Feb 18, 2025

tannergooding commented Feb 18, 2025

colejohnson66 commented Feb 18, 2025

tannergooding commented Feb 18, 2025 •

edited

Loading

timcassell commented Feb 18, 2025 •

edited

Loading

tannergooding commented Feb 18, 2025 •

edited

Loading

colejohnson66 commented Feb 19, 2025

jkotas commented Feb 19, 2025

Consider adding a LOCK CMPXCHG16B intrinsic method #28711

Consider adding a LOCK CMPXCHG16B intrinsic method #28711

Comments

gdkchan commented Feb 17, 2019

jduncanator commented Feb 17, 2019 • edited Loading

tannergooding commented Mar 25, 2020

jkotas commented Mar 25, 2020

tannergooding commented Mar 25, 2020

jkotas commented Mar 25, 2020

jkotas commented Mar 25, 2020

omariom commented Apr 30, 2021

ayende commented Jul 22, 2021

tannergooding commented Jul 30, 2022

gdkchan commented Jul 31, 2022

GSPP commented Oct 9, 2022

colejohnson66 commented Jul 18, 2024

neon-sunset commented Feb 18, 2025 • edited Loading

timcassell commented Feb 18, 2025

tannergooding commented Feb 18, 2025

colejohnson66 commented Feb 18, 2025

tannergooding commented Feb 18, 2025 • edited Loading

timcassell commented Feb 18, 2025 • edited Loading

tannergooding commented Feb 18, 2025 • edited Loading

colejohnson66 commented Feb 19, 2025

jkotas commented Feb 19, 2025

jduncanator commented Feb 17, 2019 •

edited

Loading

neon-sunset commented Feb 18, 2025 •

edited

Loading

tannergooding commented Feb 18, 2025 •

edited

Loading

timcassell commented Feb 18, 2025 •

edited

Loading

tannergooding commented Feb 18, 2025 •

edited

Loading