Improve APIs for Intel string handling intrinsics #957

CarolEidt · 2018-06-13T22:15:40Z

See dotnet/coreclr#17637 for some discussion and context.
The fundamental issues revolve around how much of the behavior should be controlled by enums passed into the intrinsic method, versus separating the different behaviors into differently-named intrinsics.

The current API includes both StringComparisonMode (https://github.com/dotnet/corefx/blob/master/src/System.Runtime.Intrinsics/ref/System.Runtime.Intrinsics.cs#L864) and ResultsFlag (https://github.com/dotnet/corefx/blob/master/src/System.Runtime.Intrinsics/ref/System.Runtime.Intrinsics.cs#L875).

Some of the open questions are:

Should the ResultsFlag, which determines the comparison results for which a true result is returned, become part of the method name?
How should these be named? According to the instruction documentation (e.g. CFlag, as in the current API), or something more descriptive like CompareNonZeroMask?
Should these be combined into a single enum or remain separate?

The text was updated successfully, but these errors were encountered:

CarolEidt · 2018-06-13T22:17:25Z

@eerhardt @fiigii @tannergooding @4creators - This is intended to be the follow-on discussion from dotnet/coreclr#17637. Please feel free to modify the description or title, and add any context I've missed.

scalablecory · 2019-02-07T17:08:59Z

I think it'd be simpler to control everything via enum rather than have a large number of methods for random combinations of flags -- otherwise I'll have to keep a translation in my head of what maps to specifically what instruction.

Is this still part of the 3.0 milestone? Very keen to start using these.

tannergooding · 2019-02-21T18:10:00Z

@CarolEidt, @terrajobst, @eerhardt, @jkotas

I've moved this to future as the APIs we plan to expose here still need more thought/design/etc.

I think someone will need to sit down and take a much deeper look at the instructions and how we should expose them, but I don't think I'll have time to do this before 3.0, so I've marked it as up-for-grabs in case someone from the community wants to have a go.

JimBobSquarePants · 2020-05-23T22:42:26Z

Heads up... CoreFX links at the top of the page are broken. Attempted to see if I could add canonical links here but wasn't sure which branch they should now point to.

HighPerfDotNet · 2021-03-18T23:21:45Z

Hi all,

First of all I want to say big thanks for all the hard work on hardware intrinsics, I've been keeping an eye on this progress for many years now and clearly the work was not as easy as I assumed it would be.

But I just want to raise this specific issue again - string comparison intrinsics are still missing, something I've been waiting for nearly 10 years and it was very disappointing to not see them in .NET 5, and after some digging it now seems unlikely these will appear at all.

From my point of view a functionally available version of these is far better than status quo (nothing!) - my answer to the 3 open questions is that choose anything, I'll be happy with it as along as I can use string intrisics in .NET code.

Thank you.

saucecontrol · 2021-03-18T23:39:34Z

Is there something specific you'd like to implement that you're having trouble with? Perhaps we can help with a workaround.

Aside from coming up with a .NET-like API for these instructions, there are other issues. Namely:

These instructions have never been fully accelerated in any Intel or AMD processor -- they are always implemented in micrococde rather than dedicated hardware. This means that an equivalent series of multiple SSE2 instructions is no slower than the single SSE4.2 instruction in practice.
These instructions were not extended to wider versions in AVX2 or AVX-512, meaning they have basically been abandoned and exist only for backward compatability. If you want to get max performance on newer processors, you'll need to write code that uses the wider instructions that do exist, and that will mirror the code that is already possible to write with the SSE2-SSE4.1 surface area exposed in .NET.

HighPerfDotNet · 2021-03-18T23:49:07Z

I've got a lot of long string comparison code that I'd like to use PcmpEstrI - it does exactly what I need 128 bits at once and giving me the byte where strings differ.

saucecontrol · 2021-03-19T00:05:13Z

In general, such a comparison can be performed with Sse2.CompareEqual followed by Sse2.MoveMask.

You can see an example of my second point above by looking at popular projects that use pcmpestri and then wanted to write the same algorithm expanded to 256-bit vectors with AVX2. Here's some SSE4.2 code and the same logic implemented with AVX2 which instead uses vpcmpeqb followed by vpmovmskb to do the same thing 256 bits at a time.

In that code, there's really no benefit to using pcmpestri in the 128-bit version over pcmpeqb + pmovmskb because of my first point above. In fact, using only the SSE2 instructions simply makes the solution available to more processors and existing .NET [Core] runtimes.

HighPerfDotNet · 2021-03-19T00:14:37Z

Ok, thank you, I'll give AVX2 a go, wasn't keen on it as it can affect clocks, worth giving it a shot since there isn't any other easier option.

saucecontrol · 2021-03-19T00:22:13Z

To be clear, you don't have to use AVX2 to work around the absence of the SSE4.2 intructions. You can use SSE2 instructions to build the same algorithm at 128 bits at a time if you find that AVX throttling causes performance issues.

I was mostly using the AVX2 example as evidence the string comparison instructions aren't strictly necessary to accelerate string comparisons and that they don't have a future -- otherwise they would exist as 256-bit instructions in AVX2 as well.

HighPerfDotNet · 2021-03-19T00:26:24Z

Makes sense, thank you for the explanation, I might benchmark both and see which one is best for me.

* Skip RemoveAttributeInstances with arguments ILLinkTrim.Attributes format received support for removing attribute instances with specific values of their arguments. The support for this in illink equals to almost 1000 lines of code. There was no data on how much size savings this can actually provide when it was introduced. My spider-sense says "miniscule savings" (possibly somewhere around 0.1%). We have plenty of other possible sources of savings with a smaller implementation complexity. This code will make sure to ignore such entries. * Update UsageBasedMetadataManager.cs

KingRikkie · 2021-11-05T15:47:07Z

In general, such a comparison can be performed with Sse2.CompareEqual followed by Sse2.MoveMask.

Of course, it can be done, but I struggle to see how this can be similar in performance to the actual instruction itself.

Take for example _mm_cmpestri(a, al, b, 16, _SIDD_CMP_EQUAL_ORDERED). This gets the index of a part of vector a in vector b while also matching partially at the very end (in case the unmatched part of vector a resides at the beginning of the next vector chunk which isn't part of b yet).

An example:

__m128i needle = _mm_loadu_si128((const __m128i*) new byte[] { 0x70, 0x80, 0x90, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 });
__m128i hay =    _mm_loadu_si128((const __m128i*) new byte[] { 0x00, 0x10, 0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80, 0x90, 0xA0, 0xB0, 0x10, 0x20, 0x30, 0x40 });

std::cout << _mm_cmpestri(needle, 3, hay, 16, _SIDD_CMP_EQUAL_ORDERED) << std::endl; // returns 7 -> can match on 1 byte boundries

needle = _mm_loadu_si128((const __m128i*) new byte[]{ 0x30, 0x40, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 });
std::cout << _mm_cmpestri(needle, 4, hay, 16, _SIDD_CMP_EQUAL_ORDERED) << std::endl; // returns 14 -> partial match at trailing bytes

needle = _mm_loadu_si128((const __m128i*) new byte[]{ 0x20, 0x30, 0x50, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 });
std::cout << _mm_cmpestri(needle, 5, hay, 16, _SIDD_CMP_EQUAL_ORDERED) << std::endl; // returns 16 -> not found

I recreated this behaviour using a loop and multiple shifts in C#:

public static int GetIndexFromVector(Vector128<byte> needle, Vector128<byte> hay, int needleLen = 16)
{
    if (needleLen == 16)
        return Sse2.MoveMask(Sse2.CompareEqual(needle, hay)) == 0xFFFF ? 0 : 16;
    else
    {
        var blendMsk = Sse2.ShiftRightLogical128BitLane(Vector128<byte>.AllBitsSet, (byte)(16 - needleLen));
        var shiftNeedle = needle;
        for (int i = 0; i < 16; i++)
        {
            if (Sse2.MoveMask(Sse2.CompareEqual(hay, Sse41.BlendVariable(hay, shiftNeedle, blendMsk))) == 0xFFFF)
                return i;
            blendMsk = Sse2.ShiftLeftLogical128BitLane(blendMsk, (byte)1);
            shiftNeedle = Sse2.ShiftLeftLogical128BitLane(shiftNeedle, (byte)1);
        }
    }
    return 16;
}

Which is nowhere near native speed:

Maybe I am missing an obvious solution?

tannergooding · 2021-11-05T17:54:46Z

Of course, it can be done, but I struggle to see how this can be similar in performance to the actual instruction itself.

The actual instruction actually performs fairly poorly being a macro-op converted to a large number of micro-ops; having decently high latency, and taking up a large number of ports:

The single instruction actually represents several different "APIs" ranging from UTF-8 to UTF-16 to various kinds of specialized masking, etc and that's part of why the underlying instruction is expensive in comparison.

Generally speaking, you can emulate the behavior with compare; move mask; lzcnt or similar; depending exactly what you need.

The length given in EAX/EDX is typically only needed in the trailing case and can be handled via masking (i.e. if you only have 15 bytes left, you need to mask off the 16th byte to be 0x00, typically using Sse2.And(data, mask))
- Particularly both lengths are only needed when comparing two "arbitrary length" strings; if you are searching for a given pattern or index; only one of these lengths is really needed
- Likewise this could alternatively be handled in the constructed MoveMask instead
The instruction can return either the most significant or least significant bit of the comparison result; which is just MoveMask and then picking LZCNT or TZCNT, respectively.

If you're directly needing _SIDD_CMP_EQUAL_ORDERED then that's 0b0000_1100. This represents:

Soruce Data Format: Unsigned bytes
Aggregation Operation: Equal ordered
Polarity: Positive Polarity (+)
Output Selection: Least Significant index

Equal ordered does an operation that is effectively "substring search":

UpperBound = imm8[0] ? 7 :15;
IntRes1 = imm8[0] ? FFH : FFFFH
For j = 0 to UpperBound, j++
    For i = 0 to UpperBound-j, k=j to UpperBound, k++, i++
        IntRes1[j] AND= overrideIfDataInvalid(BoolRes[k,i])

So, the following logic is generally what you want:

var mask = Sse2.MoveMask(Sse2.CompareEqual(left, right));

if (mask != 0)
{
    return (offset + TrailingZeroCount(mask));
}

offset += Vector128<byte>.Count;

It's similar to the implementation of Span<byte>.IndexOf which we have already accelerated here: https://source.dot.net/#System.Private.CoreLib/SpanHelpers.Byte.cs,253f89d386c4359c,references

HighPerfDotNet · 2021-12-30T02:13:22Z

Just wanted to post here that AVX2 worked very nicely, thank you for that suggestion. Hope AVX512 will be added soon, keeping an eye on those threads...

maryamariyan transferred this issue from dotnet/corefx Dec 16, 2019

Dotnet-GitSync-Bot added area-System.Runtime untriaged New issue has not been triaged by the area owner labels Dec 16, 2019

maryamariyan added area-System.Runtime.Intrinsics help wanted [up-for-grabs] Good issue for external contributors and removed area-System.Runtime labels Dec 16, 2019

maryamariyan added this to the Future milestone Dec 16, 2019

Gnbrkm41 mentioned this issue Feb 7, 2020

[Question] Does the intrinsics SSE4.2 _mm_cmpestri is exposed somewhere ? #31914

Closed

tannergooding removed the untriaged New issue has not been triaged by the area owner label Feb 10, 2020

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020

tannergooding removed the untriaged New issue has not been triaged by the area owner label Mar 4, 2020

tannergooding added the arch-x64 label Mar 25, 2020

tannergooding mentioned this issue Sep 12, 2020

Why are some x86 intrinsic vector functions missing? #41332

Closed

saucecontrol mentioned this issue Apr 19, 2022

[API Proposal]: Add SSE4.2 String Comparison Intrinsics #68204

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve APIs for Intel string handling intrinsics #957

Improve APIs for Intel string handling intrinsics #957

CarolEidt commented Jun 13, 2018

CarolEidt commented Jun 13, 2018

scalablecory commented Feb 7, 2019

tannergooding commented Feb 21, 2019

JimBobSquarePants commented May 23, 2020

HighPerfDotNet commented Mar 18, 2021

saucecontrol commented Mar 18, 2021

HighPerfDotNet commented Mar 18, 2021

saucecontrol commented Mar 19, 2021

HighPerfDotNet commented Mar 19, 2021

saucecontrol commented Mar 19, 2021

HighPerfDotNet commented Mar 19, 2021

KingRikkie commented Nov 5, 2021 •

edited

Loading

tannergooding commented Nov 5, 2021

HighPerfDotNet commented Dec 30, 2021

Improve APIs for Intel string handling intrinsics #957

Improve APIs for Intel string handling intrinsics #957

Comments

CarolEidt commented Jun 13, 2018

CarolEidt commented Jun 13, 2018

scalablecory commented Feb 7, 2019

tannergooding commented Feb 21, 2019

JimBobSquarePants commented May 23, 2020

HighPerfDotNet commented Mar 18, 2021

saucecontrol commented Mar 18, 2021

HighPerfDotNet commented Mar 18, 2021

saucecontrol commented Mar 19, 2021

HighPerfDotNet commented Mar 19, 2021

saucecontrol commented Mar 19, 2021

HighPerfDotNet commented Mar 19, 2021

KingRikkie commented Nov 5, 2021 • edited Loading

tannergooding commented Nov 5, 2021

HighPerfDotNet commented Dec 30, 2021

KingRikkie commented Nov 5, 2021 •

edited

Loading