Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve APIs for Intel string handling intrinsics #957

Open
CarolEidt opened this issue Jun 13, 2018 · 14 comments
Open

Improve APIs for Intel string handling intrinsics #957

CarolEidt opened this issue Jun 13, 2018 · 14 comments
Labels
arch-x64 area-System.Runtime.Intrinsics help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@CarolEidt
Copy link
Contributor

See dotnet/coreclr#17637 for some discussion and context.
The fundamental issues revolve around how much of the behavior should be controlled by enums passed into the intrinsic method, versus separating the different behaviors into differently-named intrinsics.

The current API includes both StringComparisonMode (https://github.com/dotnet/corefx/blob/master/src/System.Runtime.Intrinsics/ref/System.Runtime.Intrinsics.cs#L864) and ResultsFlag (https://github.com/dotnet/corefx/blob/master/src/System.Runtime.Intrinsics/ref/System.Runtime.Intrinsics.cs#L875).

Some of the open questions are:

  • Should the ResultsFlag, which determines the comparison results for which a true result is returned, become part of the method name?
  • How should these be named? According to the instruction documentation (e.g. CFlag, as in the current API), or something more descriptive like CompareNonZeroMask?
  • Should these be combined into a single enum or remain separate?
@CarolEidt
Copy link
Contributor Author

@eerhardt @fiigii @tannergooding @4creators - This is intended to be the follow-on discussion from dotnet/coreclr#17637. Please feel free to modify the description or title, and add any context I've missed.

@scalablecory
Copy link
Contributor

I think it'd be simpler to control everything via enum rather than have a large number of methods for random combinations of flags -- otherwise I'll have to keep a translation in my head of what maps to specifically what instruction.

Is this still part of the 3.0 milestone? Very keen to start using these.

@tannergooding
Copy link
Member

@CarolEidt, @terrajobst, @eerhardt, @jkotas

I've moved this to future as the APIs we plan to expose here still need more thought/design/etc.

I think someone will need to sit down and take a much deeper look at the instructions and how we should expose them, but I don't think I'll have time to do this before 3.0, so I've marked it as up-for-grabs in case someone from the community wants to have a go.

@maryamariyan maryamariyan transferred this issue from dotnet/corefx Dec 16, 2019
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Runtime untriaged New issue has not been triaged by the area owner labels Dec 16, 2019
@maryamariyan maryamariyan added area-System.Runtime.Intrinsics help wanted [up-for-grabs] Good issue for external contributors and removed area-System.Runtime labels Dec 16, 2019
@maryamariyan maryamariyan added this to the Future milestone Dec 16, 2019
@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Feb 10, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Mar 4, 2020
@JimBobSquarePants
Copy link

Heads up... CoreFX links at the top of the page are broken. Attempted to see if I could add canonical links here but wasn't sure which branch they should now point to.

@HighPerfDotNet
Copy link

Hi all,

First of all I want to say big thanks for all the hard work on hardware intrinsics, I've been keeping an eye on this progress for many years now and clearly the work was not as easy as I assumed it would be.

But I just want to raise this specific issue again - string comparison intrinsics are still missing, something I've been waiting for nearly 10 years and it was very disappointing to not see them in .NET 5, and after some digging it now seems unlikely these will appear at all.

From my point of view a functionally available version of these is far better than status quo (nothing!) - my answer to the 3 open questions is that choose anything, I'll be happy with it as along as I can use string intrisics in .NET code.

Thank you.

@saucecontrol
Copy link
Member

Is there something specific you'd like to implement that you're having trouble with? Perhaps we can help with a workaround.

Aside from coming up with a .NET-like API for these instructions, there are other issues. Namely:

  1. These instructions have never been fully accelerated in any Intel or AMD processor -- they are always implemented in micrococde rather than dedicated hardware. This means that an equivalent series of multiple SSE2 instructions is no slower than the single SSE4.2 instruction in practice.
  2. These instructions were not extended to wider versions in AVX2 or AVX-512, meaning they have basically been abandoned and exist only for backward compatability. If you want to get max performance on newer processors, you'll need to write code that uses the wider instructions that do exist, and that will mirror the code that is already possible to write with the SSE2-SSE4.1 surface area exposed in .NET.

@HighPerfDotNet
Copy link

I've got a lot of long string comparison code that I'd like to use PcmpEstrI - it does exactly what I need 128 bits at once and giving me the byte where strings differ.

@saucecontrol
Copy link
Member

In general, such a comparison can be performed with Sse2.CompareEqual followed by Sse2.MoveMask.

You can see an example of my second point above by looking at popular projects that use pcmpestri and then wanted to write the same algorithm expanded to 256-bit vectors with AVX2. Here's some SSE4.2 code and the same logic implemented with AVX2 which instead uses vpcmpeqb followed by vpmovmskb to do the same thing 256 bits at a time.

In that code, there's really no benefit to using pcmpestri in the 128-bit version over pcmpeqb + pmovmskb because of my first point above. In fact, using only the SSE2 instructions simply makes the solution available to more processors and existing .NET [Core] runtimes.

@HighPerfDotNet
Copy link

Ok, thank you, I'll give AVX2 a go, wasn't keen on it as it can affect clocks, worth giving it a shot since there isn't any other easier option.

@saucecontrol
Copy link
Member

To be clear, you don't have to use AVX2 to work around the absence of the SSE4.2 intructions. You can use SSE2 instructions to build the same algorithm at 128 bits at a time if you find that AVX throttling causes performance issues.

I was mostly using the AVX2 example as evidence the string comparison instructions aren't strictly necessary to accelerate string comparisons and that they don't have a future -- otherwise they would exist as 256-bit instructions in AVX2 as well.

@HighPerfDotNet
Copy link

Makes sense, thank you for the explanation, I might benchmark both and see which one is best for me.

MichalStrehovsky added a commit to MichalStrehovsky/runtime that referenced this issue Jul 1, 2021
* Skip RemoveAttributeInstances with arguments

ILLinkTrim.Attributes format received support for removing attribute instances with specific values of their arguments.

The support for this in illink equals to almost 1000 lines of code. There was no data on how much size savings this can actually provide when it was introduced. My spider-sense says "miniscule savings" (possibly somewhere around 0.1%).

We have plenty of other possible sources of savings with a smaller implementation complexity. This code will make sure to ignore such entries.

* Update UsageBasedMetadataManager.cs
@KingRikkie
Copy link

KingRikkie commented Nov 5, 2021

In general, such a comparison can be performed with Sse2.CompareEqual followed by Sse2.MoveMask.

Of course, it can be done, but I struggle to see how this can be similar in performance to the actual instruction itself.

Take for example _mm_cmpestri(a, al, b, 16, _SIDD_CMP_EQUAL_ORDERED). This gets the index of a part of vector a in vector b while also matching partially at the very end (in case the unmatched part of vector a resides at the beginning of the next vector chunk which isn't part of b yet).

An example:

__m128i needle = _mm_loadu_si128((const __m128i*) new byte[] { 0x70, 0x80, 0x90, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 });
__m128i hay =    _mm_loadu_si128((const __m128i*) new byte[] { 0x00, 0x10, 0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80, 0x90, 0xA0, 0xB0, 0x10, 0x20, 0x30, 0x40 });

std::cout << _mm_cmpestri(needle, 3, hay, 16, _SIDD_CMP_EQUAL_ORDERED) << std::endl; // returns 7 -> can match on 1 byte boundries

needle = _mm_loadu_si128((const __m128i*) new byte[]{ 0x30, 0x40, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 });
std::cout << _mm_cmpestri(needle, 4, hay, 16, _SIDD_CMP_EQUAL_ORDERED) << std::endl; // returns 14 -> partial match at trailing bytes

needle = _mm_loadu_si128((const __m128i*) new byte[]{ 0x20, 0x30, 0x50, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 });
std::cout << _mm_cmpestri(needle, 5, hay, 16, _SIDD_CMP_EQUAL_ORDERED) << std::endl; // returns 16 -> not found

I recreated this behaviour using a loop and multiple shifts in C#:

public static int GetIndexFromVector(Vector128<byte> needle, Vector128<byte> hay, int needleLen = 16)
{
    if (needleLen == 16)
        return Sse2.MoveMask(Sse2.CompareEqual(needle, hay)) == 0xFFFF ? 0 : 16;
    else
    {
        var blendMsk = Sse2.ShiftRightLogical128BitLane(Vector128<byte>.AllBitsSet, (byte)(16 - needleLen));
        var shiftNeedle = needle;
        for (int i = 0; i < 16; i++)
        {
            if (Sse2.MoveMask(Sse2.CompareEqual(hay, Sse41.BlendVariable(hay, shiftNeedle, blendMsk))) == 0xFFFF)
                return i;
            blendMsk = Sse2.ShiftLeftLogical128BitLane(blendMsk, (byte)1);
            shiftNeedle = Sse2.ShiftLeftLogical128BitLane(shiftNeedle, (byte)1);
        }
    }
    return 16;
}

Which is nowhere near native speed:
image

Maybe I am missing an obvious solution?

@tannergooding
Copy link
Member

Of course, it can be done, but I struggle to see how this can be similar in performance to the actual instruction itself.

The actual instruction actually performs fairly poorly being a macro-op converted to a large number of micro-ops; having decently high latency, and taking up a large number of ports:
image

The single instruction actually represents several different "APIs" ranging from UTF-8 to UTF-16 to various kinds of specialized masking, etc and that's part of why the underlying instruction is expensive in comparison.

Generally speaking, you can emulate the behavior with compare; move mask; lzcnt or similar; depending exactly what you need.

  • The length given in EAX/EDX is typically only needed in the trailing case and can be handled via masking (i.e. if you only have 15 bytes left, you need to mask off the 16th byte to be 0x00, typically using Sse2.And(data, mask))
    • Particularly both lengths are only needed when comparing two "arbitrary length" strings; if you are searching for a given pattern or index; only one of these lengths is really needed
    • Likewise this could alternatively be handled in the constructed MoveMask instead
  • The instruction can return either the most significant or least significant bit of the comparison result; which is just MoveMask and then picking LZCNT or TZCNT, respectively.

If you're directly needing _SIDD_CMP_EQUAL_ORDERED then that's 0b0000_1100. This represents:

  • Soruce Data Format: Unsigned bytes
  • Aggregation Operation: Equal ordered
  • Polarity: Positive Polarity (+)
  • Output Selection: Least Significant index

Equal ordered does an operation that is effectively "substring search":

UpperBound = imm8[0] ? 7 :15;
IntRes1 = imm8[0] ? FFH : FFFFH
For j = 0 to UpperBound, j++
    For i = 0 to UpperBound-j, k=j to UpperBound, k++, i++
        IntRes1[j] AND= overrideIfDataInvalid(BoolRes[k,i])

So, the following logic is generally what you want:

var mask = Sse2.MoveMask(Sse2.CompareEqual(left, right));

if (mask != 0)
{
    return (offset + TrailingZeroCount(mask));
}

offset += Vector128<byte>.Count;

It's similar to the implementation of Span<byte>.IndexOf which we have already accelerated here: https://source.dot.net/#System.Private.CoreLib/SpanHelpers.Byte.cs,253f89d386c4359c,references

@HighPerfDotNet
Copy link

Just wanted to post here that AVX2 worked very nicely, thank you for that suggestion. Hope AVX512 will be added soon, keeping an eye on those threads...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-x64 area-System.Runtime.Intrinsics help wanted [up-for-grabs] Good issue for external contributors
Projects
None yet
Development

No branches or pull requests

9 participants