Slotted array vectorized #376

Scooletz · 2024-07-24T12:54:51Z

This PR changes the way SlottedArray uses vectorized search.

Previously, it was using the Span.IndexOf over a set of slots. This resulted in 2x more data being searched through, with potential (but quite unlikely) false positive hits when the Slot.Raw was equal to the hash. Additionally, when hash collision occurred, it requires multiple calls to the Span.IndexOf to move the search across the map.

This PR makes a change by aligning SlottedArray to the biggest vector allowed on the given platform, Vector256 for x64 and Vector128 for ARM. It does it by introducing chunks of size of the vector that describe the hashes and the raw part providing the markers and the pointer in page. For x64, this requires to allocate 32 bytes for hashes + 32 bytes for raw = 64 bytes of chunk. There's no other overhead beside some slots potentially not being used (max 60 bytes). This approach allows to scan hashes in a nicely vectorized way without dealing with false positives and checks whether a given value is at a hash offset or not. Then, if the search is not done, jump over the vector of Slots to the next one.

The advantages are the following:

2x less comparisons
better hash collision handling (if they occur in the same vector)
no reminder handling as everything is vector aligned
a simple and tight loop

The vector alignment has been tested on x64, but showed no positive impact in benchmarks

TODO List

Benchmarks

The new benchmarks, based on aligned memory, to make them truly comparable show great results. Especially, if the search is not found in the initial keys but requires more iterations.

TryGet

Before

Method	sliceFrom	length	index	odd	Mean	Error	StdDev	Code Size
TryGet	?	?	1	False	7.180 ns	0.1448 ns	0.1355 ns	4,068 B
TryGet	?	?	15	False	7.644 ns	0.1158 ns	0.0967 ns	4,033 B
TryGet	?	?	16	False	8.136 ns	0.1283 ns	0.1200 ns	4,033 B
TryGet	?	?	31	False	8.841 ns	0.1750 ns	0.1637 ns	4,098 B
TryGet	?	?	32	False	9.113 ns	0.0999 ns	0.0934 ns	4,079 B
TryGet	?	?	47	False	9.284 ns	0.0931 ns	0.0871 ns	4,057 B
TryGet	?	?	48	False	10.058 ns	0.0464 ns	0.0412 ns	4,100 B
TryGet	?	?	63	False	10.383 ns	0.1024 ns	0.0957 ns	4,035 B
TryGet	?	?	64	False	10.979 ns	0.0954 ns	0.0893 ns	4,036 B
TryGet	?	?	95	False	12.936 ns	0.1134 ns	0.1006 ns	4,083 B
TryGet	?	?	96	False	11.370 ns	0.1728 ns	0.1616 ns	4,096 B

After

Method	sliceFrom	length	index	odd	Mean	Error	StdDev	Code Size
TryGet	?	?	1	False	7.512 ns	0.1016 ns	0.0951 ns	3,310 B
TryGet	?	?	15	False	7.455 ns	0.0670 ns	0.0627 ns	3,266 B
TryGet	?	?	16	False	7.903 ns	0.0643 ns	0.0601 ns	3,253 B
TryGet	?	?	31	False	7.816 ns	0.0808 ns	0.0756 ns	3,253 B
TryGet	?	?	32	False	8.174 ns	0.0827 ns	0.0774 ns	3,240 B
TryGet	?	?	47	False	8.147 ns	0.0778 ns	0.0650 ns	3,220 B
TryGet	?	?	48	False	8.464 ns	0.1305 ns	0.1157 ns	3,229 B
TryGet	?	?	63	False	8.487 ns	0.1186 ns	0.1110 ns	3,186 B
TryGet	?	?	64	False	8.922 ns	0.1604 ns	0.1501 ns	3,217 B
TryGet	?	?	95	False	9.197 ns	0.1391 ns	0.1233 ns	3,171 B
TryGet	?	?	96	False	10.232 ns	0.1698 ns	0.1588 ns	3,175 B

TryGet_With_Hash_Collisions

Before

Method	index	Mean	Error	StdDev	Code Size
TryGet_With_Hash_Collisions	1	23.79 ns	0.168 ns	0.157 ns	4,458 B
TryGet_With_Hash_Collisions	2	14.18 ns	0.172 ns	0.161 ns	4,338 B
TryGet_With_Hash_Collisions	3	23.95 ns	0.096 ns	0.090 ns	4,497 B
TryGet_With_Hash_Collisions	4	14.22 ns	0.120 ns	0.112 ns	4,345 B
TryGet_With_Hash_Collisions	30	15.72 ns	0.237 ns	0.222 ns	4,316 B
TryGet_With_Hash_Collisions	31	23.88 ns	0.193 ns	0.180 ns	4,496 B

After

Method	index	Mean	Error	StdDev	Code Size
TryGet_With_Hash_Collisions	1	21.75 ns	0.171 ns	0.151 ns	5,598 B
TryGet_With_Hash_Collisions	2	14.82 ns	0.297 ns	0.278 ns	5,551 B
TryGet_With_Hash_Collisions	3	21.62 ns	0.232 ns	0.205 ns	5,649 B
TryGet_With_Hash_Collisions	4	14.73 ns	0.094 ns	0.088 ns	5,553 B
TryGet_With_Hash_Collisions	30	15.31 ns	0.204 ns	0.191 ns	5,528 B
TryGet_With_Hash_Collisions	31	22.44 ns	0.212 ns	0.188 ns	5,606 B

`SlottedArray` upgraded design

                                                                   
 ┌─────────────────┬────────────────────┬─────────────────────────┐  
 │ HEADER          │ VECTOR of Hashes   │ VECTOR of Slots         │  
 │                 │                    │                         │  
 │                 │                    │                         │  
 │                 │                    │                         │  
 ├─────────────────┴──────┬─────────────┴───────┬─────────────────┤  
 │ VECTOR or Hashes       │ Vector of Slots     │                 │  
 │                        │                     │                 │  
 │                        │                     │                 │  
 │                        │                     │                 │  
 ├────────────────────────┴─────────────────────┘                 │  
 │                                                                │  
 │                                                                │  
 │                                                                │  
 │                                                                │  
 │                                                                │  
 │                                                                │  
 │                                                                │  
 │                            ┌─────────┬─────────────────────────┤  
 │                            │         │                         │  
 │                            │         │                         │  
 │                            │  DATA   │                DATA     │  
 │                            │for slot1│              for slot 0 │  
 │                            │         │                         │  
 └────────────────────────────┴─────────┴─────────────────────────┘

github-actions · 2024-07-26T10:48:31Z

Package	Line Rate	Branch Rate	Health
Paprika	84%	79%	➖
Summary	84% (4224 / 4999)	79% (1336 / 1701)	➖

Minimum allowed line rate is 75%

Scooletz added 6 commits July 23, 2024 13:52

more vectorization

a818450

TryFind loop shortened

dfe3f40

Search fixed

902f3d1

aligned search

a16c1c6

single vector at a time

2e8823f

SlottedArray made more vector-aware

b3e3d98

Scooletz added 🐌 performance Perofrmance related issue 💥Breaking The change introduces a storage breaking change. labels Jul 24, 2024

Scooletz added 9 commits July 24, 2024 17:36

benchmarks reworked

c22ebee

unneeded removed

ef7f860

assert removed

0026366

slightly simplified payload retrieval

0d95544

Vector128 added

dae49e9

fix of the search

29025f1

dummy assert removed

0ef13aa

bug fixed

90c99b1

undo dummy testing

abfb04b

Scooletz force-pushed the slotted-array-vectorized branch from 39ad9a8 to abfb04b Compare July 25, 2024 16:13

Scooletz added 4 commits July 26, 2024 10:28

benchmarks updated and GetPayload simplified

a69bd6c

hash collision benchmark

fe2ccd9

design updated

650045f

format

109b2db

Scooletz marked this pull request as ready for review July 26, 2024 10:50

Scooletz merged commit d91122a into main Jul 26, 2024
2 checks passed

Scooletz deleted the slotted-array-vectorized branch July 26, 2024 10:51

Scooletz mentioned this pull request Nov 20, 2024

Vectorized Defragmentation #439

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slotted array vectorized #376

Slotted array vectorized #376

Scooletz commented Jul 24, 2024 •

edited

Loading

github-actions bot commented Jul 26, 2024

Slotted array vectorized #376

Slotted array vectorized #376

Conversation

Scooletz commented Jul 24, 2024 • edited Loading

TODO List

Benchmarks

TryGet

Before

After

TryGet_With_Hash_Collisions

Before

After

SlottedArray upgraded design

github-actions bot commented Jul 26, 2024

Scooletz commented Jul 24, 2024 •

edited

Loading

`SlottedArray` upgraded design