Performance regression of System.MemoryExtensions.SequenceEqual measured on .NET 8.0 Intel Core i7-7700HQ #95346

ymalich · 2023-11-28T17:09:26Z

Description

Hello guys,
I've tested the function System.MemoryExtensions.SequenceEqual with BenchmarkDotNet v0.13.10 on my laptop wtith Intel Core i7-7700HQ CPU and got about 35% performance regression comparing to .NET 6.0 and .NET 7.0 with 4K data buffers.

I've run the test several times, the issue is reproducible.

Configuration

BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3693/22H2/2022Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK 8.0.100
[Host] : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
.NET 6.0 : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2
.NET 7.0 : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2
.NET 8.0 : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
.NET Framework 4.8 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256

Data

Method	Job	Runtime	N	Mean	Error	StdDev	Ratio	RatioSD	Code Size
SequenceEqual	.NET 6.0	.NET 6.0	4096	101.7 ns	2.08 ns	3.35 ns	1.00	0.00	407 B
SequenceEqual	.NET 7.0	.NET 7.0	4096	100.6 ns	2.05 ns	4.10 ns	0.99	0.05	384 B
SequenceEqual	.NET 8.0	.NET 8.0	4096	138.0 ns	2.79 ns	3.53 ns	1.36	0.06	394 B
SequenceEqual	.NET Framework 4.8	.NET Framework 4.8	4096	116.8 ns	2.36 ns	5.42 ns	1.15	0.06	463 B

I've attached the code and BenchmarkDotNet.Artifacts here
SequenceEqualBench.zip

Analysis

I've checked the asm listings.
.NET 6.0, 7.0 and 8.0 use the different VMOV commands to load the data into ymm registers in the AVX loop..
.NET 6.0 - vmovupd
.NET 7.0 - vmovdqu
.NET 8.0 - vmovups
So, I guess if it can explain the issue

<title></title> <style type="text/css"> body,div,table,thead,tbody,tfoot,tr,th,td,p { font-family:"Liberation Sans"; font-size:x-small } a.comment-indicator:hover + comment { background:#ffd; position:absolute; display:block; border:1px solid black; padding:0.5em; } a.comment-indicator { background:red; display:inline-block; border:1px solid black; width:0.5em; height:0.5em; } comment { display:none; } </style>

NET 6.0	NET 7.0	NET 8.0

M01_L01:	M01_L01:	M01_L01:
vmovupd ymm0,[rcx+rax]	vmovdqu ymm0,ymmword ptr [rcx+rax]	vmovups ymm0,[rcx+rax]
vmovupd ymm1,[rdx+rax]
vpcmpeqb ymm0,ymm0,ymm1	vpcmpeqb ymm0,ymm0,[rdx+rax]	vpcmpeqb ymm0,ymm0,[rdx+rax]
vpmovmskb r9d,ymm0	vpmovmskb r9d,ymm0	vpmovmskb r10d,ymm0
cmp r9d,0FFFFFFFF	cmp r9d,0FFFFFFFF	cmp r10d,0FFFFFFFF
jne short M01_L08	jne short M01_L06	jne near ptr M01_L14
add rax,20	add rax,20	add rax,20
cmp r8,rax	cmp r8,rax	cmp r8,rax
ja short M01_L01	ja short M01_L01	ja M01_L01

ghost · 2023-11-28T17:09:34Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Hello guys,
I've tested the function System.MemoryExtensions.SequenceEqual with BenchmarkDotNet v0.13.10 on my laptop wtith Intel Core i7-7700HQ CPU and got about 35% performance regression comparing to .NET 6.0 and .NET 7.0 with 4K data buffers.

I've run the test several times, the issue is reproducible.

Configuration

BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3693/22H2/2022Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK 8.0.100
[Host] : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
.NET 6.0 : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2
.NET 7.0 : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2
.NET 8.0 : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
.NET Framework 4.8 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256

Data

Method	Job	Runtime	N	Mean	Error	StdDev	Ratio	RatioSD	Code Size
SequenceEqual	.NET 6.0	.NET 6.0	4096	101.7 ns	2.08 ns	3.35 ns	1.00	0.00	407 B
SequenceEqual	.NET 7.0	.NET 7.0	4096	100.6 ns	2.05 ns	4.10 ns	0.99	0.05	384 B
SequenceEqual	.NET 8.0	.NET 8.0	4096	138.0 ns	2.79 ns	3.53 ns	1.36	0.06	394 B
SequenceEqual	.NET Framework 4.8	.NET Framework 4.8	4096	116.8 ns	2.36 ns	5.42 ns	1.15	0.06	463 B

I've attached the code and BenchmarkDotNet.Artifacts here
SequenceEqualBench.zip

Analysis

I've checked the asm listings.
.NET 6.0, 7.0 and 8.0 use the different VMOV commands to load the data into ymm registers in the AVX loop..
.NET 6.0 - vmovupd
.NET 7.0 - vmovdqu
.NET 8.0 - vmovups
So, I guess if it can explain the issue

<title></title> <style type="text/css"> body,div,table,thead,tbody,tfoot,tr,th,td,p { font-family:"Liberation Sans"; font-size:x-small } a.comment-indicator:hover + comment { background:#ffd; position:absolute; display:block; border:1px solid black; padding:0.5em; } a.comment-indicator { background:red; display:inline-block; border:1px solid black; width:0.5em; height:0.5em; } comment { display:none; } </style>

NET 6.0	NET 7.0	NET 8.0

M01_L01:	M01_L01:	M01_L01:
vmovupd ymm0,[rcx+rax]	vmovdqu ymm0,ymmword ptr [rcx+rax]	vmovups ymm0,[rcx+rax]
vmovupd ymm1,[rdx+rax]
vpcmpeqb ymm0,ymm0,ymm1	vpcmpeqb ymm0,ymm0,[rdx+rax]	vpcmpeqb ymm0,ymm0,[rdx+rax]
vpmovmskb r9d,ymm0	vpmovmskb r9d,ymm0	vpmovmskb r10d,ymm0
cmp r9d,0FFFFFFFF	cmp r9d,0FFFFFFFF	cmp r10d,0FFFFFFFF
jne short M01_L08	jne short M01_L06	jne near ptr M01_L14
add rax,20	add rax,20	add rax,20
cmp r8,rax	cmp r8,rax	cmp r8,rax
ja short M01_L01	ja short M01_L01	ja

Author:	ymalich
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`
Milestone:	-

EgorBo · 2023-11-28T18:10:08Z

On my machine (with AVX512 being enabled and disabled - results are the same) - Ryzen 7950X

xtqqczze · 2023-11-28T18:37:24Z

Unlikely to be related to the different vmov instructions on Kaby Lake (Skylake), see the following llvm-mca output:

https://www.diffchecker.com/lfs3T7Ik/

tannergooding · 2023-11-28T18:44:25Z

It could potentially be related to alignment or the jcc erratum (#93243)

On most modern hardware, the various movups/movupd/movdqu instructions are all treated the same and don't typically have different ports executing them, so there isn't often any form of penalty associated with using one over the other (as there may have been 20 years ago).

xtqqczze · 2023-11-29T00:24:50Z

This is very likely the jcc erratum.

In .NET 8.0, the branch to the loop start overlays a 32-byte boundary, when the loop is 32-byte aligned:

;; .NET 8.0
vmovups ymm0, ymmword ptr[rcx+rax]
vpcmpeqb ymm0, ymm0, ymmword ptr[rdx+rax]
vpmovmskb r10d, ymm0
cmp r10d, -1
jne notequal
add rax, 16
cmp r8, rax
ja short loopstart
endofloop:  ;; offset=0x0021

.NET 7.0
loopstart:  ;; offset=0x0000
vmovdqu ymm0, ymmword ptr[rcx+rax]
vpcmpeqb ymm0, ymm0, ymmword ptr[rdx+rax]
vpmovmskb r9d, ymm0
cmp r9d, -1
jne short notequal
add rax, 16
cmp r8, rax
ja short loopstart
endofloop:  ;; offset=0x001d

EgorBo · 2024-01-04T22:43:04Z

Let's close it then since we already have an issue for JCC mitigation - #93243

ymalich added the tenet-performance Performance related issue label Nov 28, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 28, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Nov 28, 2023

EgorBo mentioned this issue Dec 6, 2023

ReadOnlySpan<byte> SequenceEqual method seems to have regressed 25-30% between .Net 6 and .Net 8 #95703

Closed

EgorBo closed this as completed Jan 4, 2024

ghost removed the untriaged New issue has not been triaged by the area owner label Jan 4, 2024

github-actions bot locked and limited conversation to collaborators Feb 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression of System.MemoryExtensions.SequenceEqual measured on .NET 8.0 Intel Core i7-7700HQ #95346

Performance regression of System.MemoryExtensions.SequenceEqual measured on .NET 8.0 Intel Core i7-7700HQ #95346

ymalich commented Nov 28, 2023 •

edited

Loading

ghost commented Nov 28, 2023

Description

Configuration

Data

Analysis

EgorBo commented Nov 28, 2023

xtqqczze commented Nov 28, 2023

tannergooding commented Nov 28, 2023

xtqqczze commented Nov 29, 2023 •

edited

Loading

EgorBo commented Jan 4, 2024

Performance regression of System.MemoryExtensions.SequenceEqual measured on .NET 8.0 Intel Core i7-7700HQ #95346

Performance regression of System.MemoryExtensions.SequenceEqual measured on .NET 8.0 Intel Core i7-7700HQ #95346

Comments

ymalich commented Nov 28, 2023 • edited Loading

Description

Configuration

Data

Analysis

ghost commented Nov 28, 2023

Description

Configuration

Data

Analysis

EgorBo commented Nov 28, 2023

xtqqczze commented Nov 28, 2023

tannergooding commented Nov 28, 2023

xtqqczze commented Nov 29, 2023 • edited Loading

EgorBo commented Jan 4, 2024

ymalich commented Nov 28, 2023 •

edited

Loading

xtqqczze commented Nov 29, 2023 •

edited

Loading