Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize IndexOfAnyExcept<T>(T value) #73488

Merged
merged 3 commits into from
Aug 10, 2022

Conversation

stephentoub
Copy link
Member

Contributes to #67942

[Params(1, 4, 16, 64, 256, 1024)]
public int Length { get; set; }

private byte[] _zeros;

[GlobalSetup]
public void Setup() => _zeros = new byte[Length];

[Benchmark]
public bool AllZeroTrue() => _zeros.AsSpan().IndexOfAnyExcept((byte)0) < 0;
Method Toolchain Length Mean Ratio Code Size
AllZeroTrue \main\corerun.exe 1 0.6482 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 1 1.7013 ns 2.62 242 B
AllZeroTrue \main\corerun.exe 4 2.6035 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 4 3.5931 ns 1.38 242 B
AllZeroTrue \main\corerun.exe 16 4.8521 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 16 2.3925 ns 0.49 242 B
AllZeroTrue \main\corerun.exe 64 18.6816 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 64 3.5615 ns 0.19 242 B
AllZeroTrue \main\corerun.exe 256 81.8503 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 256 10.3610 ns 0.13 242 B
AllZeroTrue \main\corerun.exe 1024 310.6106 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 1024 37.1116 ns 0.12 242 B

The one thing that makes me a tad hesitant is the case of the except value being found at the very beginning, in which case this does incur a (very small) penalty even for really long inputs, e.g.

[Params(1024)]
public int Length { get; set; }

private byte[] _allBitsSet;

[GlobalSetup]
public void Setup() => _allBitsSet = Enumerable.Repeat((byte)0xFF, Length).ToArray();

[Benchmark]
public bool AllBitsSetLookingForNon0() => _allBitsSet.AsSpan().IndexOfAnyExcept((byte)0) < 0;
Method Toolchain Length Mean Ratio RatioSD Code Size
AllBitsSetLookingForNon0 \main\corerun.exe 1024 0.4979 ns 1.00 0.00 72 B
AllBitsSetLookingForNon0 \pr\corerun.exe 1024 1.7421 ns 3.51 0.11 242 B

I'm not sure what if anything to do about it.

@stephentoub stephentoub added this to the 7.0.0 milestone Aug 5, 2022
@ghost ghost assigned stephentoub Aug 5, 2022
@stephentoub stephentoub changed the title Vectorize IndexOfAnyException<T>(T value) Vectorize IndexOfAnyExcept<T>(T value) Aug 5, 2022
@ghost
Copy link

ghost commented Aug 5, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Contributes to #67942

[Params(1, 4, 16, 64, 256, 1024)]
public int Length { get; set; }

private byte[] _zeros;

[GlobalSetup]
public void Setup() => _zeros = new byte[Length];

[Benchmark]
public bool AllZeroTrue() => _zeros.AsSpan().IndexOfAnyExcept((byte)0) < 0;
Method Toolchain Length Mean Ratio Code Size
AllZeroTrue \main\corerun.exe 1 0.6482 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 1 1.7013 ns 2.62 242 B
AllZeroTrue \main\corerun.exe 4 2.6035 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 4 3.5931 ns 1.38 242 B
AllZeroTrue \main\corerun.exe 16 4.8521 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 16 2.3925 ns 0.49 242 B
AllZeroTrue \main\corerun.exe 64 18.6816 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 64 3.5615 ns 0.19 242 B
AllZeroTrue \main\corerun.exe 256 81.8503 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 256 10.3610 ns 0.13 242 B
AllZeroTrue \main\corerun.exe 1024 310.6106 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 1024 37.1116 ns 0.12 242 B

The one thing that makes me a tad hesitant is the case of the except value being found at the very beginning, in which case this does incur a (very small) penalty even for really long inputs, e.g.

[Params(1024)]
public int Length { get; set; }

private byte[] _allBitsSet;

[GlobalSetup]
public void Setup() => _allBitsSet = Enumerable.Repeat((byte)0xFF, Length).ToArray();

[Benchmark]
public bool AllBitsSetLookingForNon0() => _allBitsSet.AsSpan().IndexOfAnyExcept((byte)0) < 0;
Method Toolchain Length Mean Ratio RatioSD Code Size
AllBitsSetLookingForNon0 \main\corerun.exe 1024 0.4979 ns 1.00 0.00 72 B
AllBitsSetLookingForNon0 \pr\corerun.exe 1024 1.7421 ns 3.51 0.11 242 B

I'm not sure what if anything to do about it.

Author: stephentoub
Assignees: -
Labels:

area-System.Memory

Milestone: 7.0.0

@stephentoub
Copy link
Member Author

Failure is #73247

@stephentoub
Copy link
Member Author

@adamsitnik, can you please review this? Thanks.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @stephentoub ! I really like the namings you use + lack of gotos ;)

@stephentoub
Copy link
Member Author

Thanks for reviewing, Adam.

@tannergooding
Copy link
Member

I'm not sure what if anything to do about it.

I think this is an acceptable tradeoff. The measured difference is just over 1ns which would be about 2-4 clock cycles on most modern computers, this is likely representative of the jump to IndexOfAnyExceptValueType and branch cost for checking the length.

In practice, the actual spilling of args, call, and general memory latency is more than this so most users likely won't see a difference. It's likewise a general avenue we might want to look at improved JIT support around, for being able to ensure that cost for such checks can be minimized.

One think that might help would be making it so that you have a pattern like the following and marking the containing method as AggressiveInlining

if (length < Vector128<T>.Count)
{
    ScalarImpl();
}
else
{
    VectorImpl();
}

This would theoretically allow the JIT to hoist, constant fold, or otherwise mitigate the branch for various input kinds.

{
for (int i = 0; i < length; i++)
{
if (!Unsafe.Add(ref searchSpace, i).Equals(value0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, do you have the disassembly the JIT emits for this?

I'd expect it emits a cmp reg, [addr] (where reg contains value0), but I know there were some issues with Unsafe.Add being converted to an [addr] before.

@stephentoub stephentoub merged commit 301d8e0 into dotnet:main Aug 10, 2022
@stephentoub stephentoub deleted the vectorizeindexofany1 branch August 10, 2022 20:55
@stephentoub
Copy link
Member Author

this is likely representative of the jump to IndexOfAnyExceptValueType and branch cost for checking the length.

To be clear, I wasn't referring to the case where there are fewer than Count elements, rather the case where there are more than Count elements and the very first element doesn't match the specified value. I expect that to be a fairly common occurrence with some uses of this method.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 10, 2022
@EgorBo
Copy link
Member

EgorBo commented Sep 13, 2022

Hm.. looks like this regressed System.Collections.Concurrent.IsEmpty<String>.Dictionary(Size: 512) but none regressions were filed
image

https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_ubuntu%2018.04%2fSystem.Collections.Concurrent.IsEmpty(String).Dictionary(Size%3a%20512).html

(I blamed this PR because ConcurrentDictionary uses IndexOfAnyExcept for AreAllBucketsEmpty)

@stephentoub
Copy link
Member Author

looks like this regressed System.Collections.Concurrent.IsEmpty.Dictionary(Size: 512) but none regressions were filed

By a nanosecond or two, right? That's largely expected. I'd expect IsEmpty to get a tad slower when the dictionary isn't empty and faster when it is empty; basically an extra branch or two in the check.

@EgorBo
Copy link
Member

EgorBo commented Sep 13, 2022

I totally agree and didn't file an issue for it, it's just needed for perf study report

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants