Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression of System.MemoryExtensions.SequenceEqual measured on .NET 8.0 Intel Core i7-7700HQ #95346

Closed
ymalich opened this issue Nov 28, 2023 · 6 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue

Comments

@ymalich
Copy link

ymalich commented Nov 28, 2023

Description

Hello guys,
I've tested the function System.MemoryExtensions.SequenceEqual with BenchmarkDotNet v0.13.10 on my laptop wtith Intel Core i7-7700HQ CPU and got about 35% performance regression comparing to .NET 6.0 and .NET 7.0 with 4K data buffers.

I've run the test several times, the issue is reproducible.

Configuration

BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3693/22H2/2022Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK 8.0.100
[Host] : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
.NET 6.0 : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2
.NET 7.0 : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2
.NET 8.0 : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
.NET Framework 4.8 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256

Data

Method Job Runtime N Mean Error StdDev Ratio RatioSD Code Size
SequenceEqual .NET 6.0 .NET 6.0 4096 101.7 ns 2.08 ns 3.35 ns 1.00 0.00 407 B
SequenceEqual .NET 7.0 .NET 7.0 4096 100.6 ns 2.05 ns 4.10 ns 0.99 0.05 384 B
SequenceEqual .NET 8.0 .NET 8.0 4096 138.0 ns 2.79 ns 3.53 ns 1.36 0.06 394 B
SequenceEqual .NET Framework 4.8 .NET Framework 4.8 4096 116.8 ns 2.36 ns 5.42 ns 1.15 0.06 463 B

I've attached the code and BenchmarkDotNet.Artifacts here
SequenceEqualBench.zip

Analysis

I've checked the asm listings.
.NET 6.0, 7.0 and 8.0 use the different VMOV commands to load the data into ymm registers in the AVX loop..
.NET 6.0 - vmovupd
.NET 7.0 - vmovdqu
.NET 8.0 - vmovups
So, I guess if it can explain the issue

<title></title> <style type="text/css"> body,div,table,thead,tbody,tfoot,tr,th,td,p { font-family:"Liberation Sans"; font-size:x-small } a.comment-indicator:hover + comment { background:#ffd; position:absolute; display:block; border:1px solid black; padding:0.5em; } a.comment-indicator { background:red; display:inline-block; border:1px solid black; width:0.5em; height:0.5em; } comment { display:none; } </style>
NET 6.0 NET 7.0 NET 8.0
     
M01_L01: M01_L01: M01_L01:
vmovupd ymm0,[rcx+rax] vmovdqu ymm0,ymmword ptr [rcx+rax] vmovups ymm0,[rcx+rax]
vmovupd ymm1,[rdx+rax]    
vpcmpeqb ymm0,ymm0,ymm1 vpcmpeqb ymm0,ymm0,[rdx+rax] vpcmpeqb ymm0,ymm0,[rdx+rax]
vpmovmskb r9d,ymm0 vpmovmskb r9d,ymm0 vpmovmskb r10d,ymm0
cmp r9d,0FFFFFFFF cmp r9d,0FFFFFFFF cmp r10d,0FFFFFFFF
jne short M01_L08 jne short M01_L06 jne near ptr M01_L14
add rax,20 add rax,20 add rax,20
cmp r8,rax cmp r8,rax cmp r8,rax
ja short M01_L01 ja short M01_L01 ja M01_L01
@ymalich ymalich added the tenet-performance Performance related issue label Nov 28, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 28, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Nov 28, 2023
@ghost
Copy link

ghost commented Nov 28, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Hello guys,
I've tested the function System.MemoryExtensions.SequenceEqual with BenchmarkDotNet v0.13.10 on my laptop wtith Intel Core i7-7700HQ CPU and got about 35% performance regression comparing to .NET 6.0 and .NET 7.0 with 4K data buffers.

I've run the test several times, the issue is reproducible.

Configuration

BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3693/22H2/2022Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK 8.0.100
[Host] : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
.NET 6.0 : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2
.NET 7.0 : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2
.NET 8.0 : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
.NET Framework 4.8 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256

Data

Method Job Runtime N Mean Error StdDev Ratio RatioSD Code Size
SequenceEqual .NET 6.0 .NET 6.0 4096 101.7 ns 2.08 ns 3.35 ns 1.00 0.00 407 B
SequenceEqual .NET 7.0 .NET 7.0 4096 100.6 ns 2.05 ns 4.10 ns 0.99 0.05 384 B
SequenceEqual .NET 8.0 .NET 8.0 4096 138.0 ns 2.79 ns 3.53 ns 1.36 0.06 394 B
SequenceEqual .NET Framework 4.8 .NET Framework 4.8 4096 116.8 ns 2.36 ns 5.42 ns 1.15 0.06 463 B

I've attached the code and BenchmarkDotNet.Artifacts here
SequenceEqualBench.zip

Analysis

I've checked the asm listings.
.NET 6.0, 7.0 and 8.0 use the different VMOV commands to load the data into ymm registers in the AVX loop..
.NET 6.0 - vmovupd
.NET 7.0 - vmovdqu
.NET 8.0 - vmovups
So, I guess if it can explain the issue

<title></title> <style type="text/css"> body,div,table,thead,tbody,tfoot,tr,th,td,p { font-family:"Liberation Sans"; font-size:x-small } a.comment-indicator:hover + comment { background:#ffd; position:absolute; display:block; border:1px solid black; padding:0.5em; } a.comment-indicator { background:red; display:inline-block; border:1px solid black; width:0.5em; height:0.5em; } comment { display:none; } </style>
NET 6.0 NET 7.0 NET 8.0
     
M01_L01: M01_L01: M01_L01:
vmovupd ymm0,[rcx+rax] vmovdqu ymm0,ymmword ptr [rcx+rax] vmovups ymm0,[rcx+rax]
vmovupd ymm1,[rdx+rax]    
vpcmpeqb ymm0,ymm0,ymm1 vpcmpeqb ymm0,ymm0,[rdx+rax] vpcmpeqb ymm0,ymm0,[rdx+rax]
vpmovmskb r9d,ymm0 vpmovmskb r9d,ymm0 vpmovmskb r10d,ymm0
cmp r9d,0FFFFFFFF cmp r9d,0FFFFFFFF cmp r10d,0FFFFFFFF
jne short M01_L08 jne short M01_L06 jne near ptr M01_L14
add rax,20 add rax,20 add rax,20
cmp r8,rax cmp r8,rax cmp r8,rax
ja short M01_L01 ja short M01_L01 ja
Author: ymalich
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr

Milestone: -

@EgorBo
Copy link
Member

EgorBo commented Nov 28, 2023

image

On my machine (with AVX512 being enabled and disabled - results are the same) - Ryzen 7950X

@xtqqczze
Copy link
Contributor

Unlikely to be related to the different vmov instructions on Kaby Lake (Skylake), see the following llvm-mca output:

https://www.diffchecker.com/lfs3T7Ik/

@tannergooding
Copy link
Member

It could potentially be related to alignment or the jcc erratum (#93243)

On most modern hardware, the various movups/movupd/movdqu instructions are all treated the same and don't typically have different ports executing them, so there isn't often any form of penalty associated with using one over the other (as there may have been 20 years ago).

@xtqqczze
Copy link
Contributor

xtqqczze commented Nov 29, 2023

This is very likely the jcc erratum.

In .NET 8.0, the branch to the loop start overlays a 32-byte boundary, when the loop is 32-byte aligned:

;; .NET 8.0
vmovups ymm0, ymmword ptr[rcx+rax]
vpcmpeqb ymm0, ymm0, ymmword ptr[rdx+rax]
vpmovmskb r10d, ymm0
cmp r10d, -1
jne notequal
add rax, 16
cmp r8, rax
ja short loopstart
endofloop:  ;; offset=0x0021
.NET 7.0
loopstart:  ;; offset=0x0000
vmovdqu ymm0, ymmword ptr[rcx+rax]
vpcmpeqb ymm0, ymm0, ymmword ptr[rdx+rax]
vpmovmskb r9d, ymm0
cmp r9d, -1
jne short notequal
add rax, 16
cmp r8, rax
ja short loopstart
endofloop:  ;; offset=0x001d

@EgorBo
Copy link
Member

EgorBo commented Jan 4, 2024

Let's close it then since we already have an issue for JCC mitigation - #93243

@EgorBo EgorBo closed this as completed Jan 4, 2024
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jan 4, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Feb 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants