Skip to content
This repository has been archived by the owner on May 10, 2018. It is now read-only.

Superseek #120

Closed
wants to merge 6 commits into from
Closed

Superseek #120

wants to merge 6 commits into from

Conversation

benaadams
Copy link
Contributor

@benaadams benaadams commented Oct 5, 2016

Updated with results (benchmark included in Tests.Performance)

/cc @davidfowl

@benaadams benaadams changed the title Superseek Improved Vector Seek Oct 5, 2016
@benaadams benaadams changed the title Improved Vector Seek Superseek Oct 5, 2016
@benaadams benaadams force-pushed the superseek branch 3 times, most recently from 944a1a4 to 4e9654d Compare October 7, 2016 23:21
@benaadams benaadams force-pushed the superseek branch 2 times, most recently from d954b74 to 58f6816 Compare October 16, 2016 03:00
@benaadams
Copy link
Contributor Author

benaadams commented Oct 16, 2016

When Vector<byte>.Count == 16

BenchmarkDotNet.Core=v0.9.9.0
Processor=Intel(R) Core(TM) i5-3470S CPU 2.90GHz, ProcessorCount=4
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]

Type=TrySliceToBenchmark  Mode=Throughput  Platform=X64
Jit=RyuJit  Toolchain=Clr  Runtime=Clr
Method Size Median StdDev Scaled
NaiveTrySliceTo 1 13.2926 ns 0.2921 ns 1.00
VectorTernaryTrySliceTo 1 16.4280 ns 0.3826 ns 1.24
VectorMagicTrySliceTo 1 15.7509 ns 0.2967 ns 1.18
VectorLongMagicTrySliceTo 1 16.0006 ns 0.2944 ns 1.20
NaiveTrySliceTo 2 14.9989 ns 0.4907 ns 1.00
VectorTernaryTrySliceTo 2 18.3470 ns 0.4577 ns 1.22
VectorMagicTrySliceTo 2 16.7865 ns 0.3586 ns 1.11
VectorLongMagicTrySliceTo 2 17.6176 ns 0.1983 ns 1.16
NaiveTrySliceTo 4 17.8322 ns 0.1951 ns 1.00
VectorTernaryTrySliceTo 4 20.4808 ns 0.2275 ns 1.15
VectorMagicTrySliceTo 4 19.5583 ns 0.2834 ns 1.10
VectorLongMagicTrySliceTo 4 20.1689 ns 0.2960 ns 1.13
NaiveTrySliceTo 7 21.8616 ns 0.3208 ns 1.00
VectorTernaryTrySliceTo 7 24.5644 ns 0.3488 ns 1.12
VectorMagicTrySliceTo 7 23.6222 ns 0.3692 ns 1.07
VectorLongMagicTrySliceTo 7 24.7915 ns 0.4079 ns 1.13
NaiveTrySliceTo 8 23.6697 ns 0.4263 ns 1.00
VectorTernaryTrySliceTo 8 26.4905 ns 0.3170 ns 1.12
VectorMagicTrySliceTo 8 24.9771 ns 0.4185 ns 1.06
VectorLongMagicTrySliceTo 8 17.4393 ns 0.4846 ns 0.74
NaiveTrySliceTo 15 33.9569 ns 0.5298 ns 1.00
VectorTernaryTrySliceTo 15 36.7626 ns 0.4356 ns 1.09
VectorMagicTrySliceTo 15 35.2061 ns 0.4761 ns 1.04
VectorLongMagicTrySliceTo 15 28.7885 ns 0.3972 ns 0.85
NaiveTrySliceTo 16 41.7755 ns 0.9026 ns 1.00
VectorTernaryTrySliceTo 16 25.0963 ns 0.3550 ns 0.60
VectorMagicTrySliceTo 16 19.7682 ns 0.3651 ns 0.47
VectorLongMagicTrySliceTo 16 20.6818 ns 0.3460 ns 0.49
NaiveTrySliceTo 31 55.3319 ns 0.8148 ns 1.00
VectorTernaryTrySliceTo 31 39.7991 ns 0.5552 ns 0.72
VectorMagicTrySliceTo 31 39.6184 ns 0.3946 ns 0.72
VectorLongMagicTrySliceTo 31 32.5610 ns 0.3455 ns 0.59
NaiveTrySliceTo 32 56.4384 ns 0.5997 ns 1.00
VectorTernaryTrySliceTo 32 28.3936 ns 0.5015 ns 0.50
VectorMagicTrySliceTo 32 23.4840 ns 0.3352 ns 0.42
VectorLongMagicTrySliceTo 32 24.9616 ns 0.3948 ns 0.44
NaiveTrySliceTo 40 63.5394 ns 0.8637 ns 1.00
VectorTernaryTrySliceTo 40 33.8772 ns 0.8033 ns 0.53
VectorMagicTrySliceTo 40 33.9677 ns 0.3951 ns 0.53
VectorLongMagicTrySliceTo 40 25.1011 ns 0.5623 ns 0.40
NaiveTrySliceTo 63 82.2659 ns 1.1792 ns 1.00
VectorTernaryTrySliceTo 63 46.3311 ns 0.6486 ns 0.56
VectorMagicTrySliceTo 63 47.8367 ns 0.7080 ns 0.58
VectorLongMagicTrySliceTo 63 40.5958 ns 0.5921 ns 0.49
NaiveTrySliceTo 64 84.5968 ns 0.9552 ns 1.00
VectorTernaryTrySliceTo 64 35.3680 ns 0.4865 ns 0.42
VectorMagicTrySliceTo 64 31.6917 ns 0.3769 ns 0.37
VectorLongMagicTrySliceTo 64 32.5373 ns 0.4650 ns 0.38
NaiveTrySliceTo 120 132.6752 ns 1.7712 ns 1.00
VectorTernaryTrySliceTo 120 49.5945 ns 0.7436 ns 0.37
VectorMagicTrySliceTo 120 52.9125 ns 0.8374 ns 0.40
VectorLongMagicTrySliceTo 120 44.3256 ns 0.6461 ns 0.33
NaiveTrySliceTo 127 139.2961 ns 1.8908 ns 1.00
VectorTernaryTrySliceTo 127 59.5576 ns 0.6872 ns 0.43
VectorMagicTrySliceTo 127 62.8391 ns 0.7938 ns 0.45
VectorLongMagicTrySliceTo 127 54.7734 ns 0.5165 ns 0.39
NaiveTrySliceTo 128 140.4608 ns 1.5607 ns 1.00
VectorTernaryTrySliceTo 128 48.3670 ns 1.4507 ns 0.35
VectorMagicTrySliceTo 128 46.7192 ns 0.5257 ns 0.33
VectorLongMagicTrySliceTo 128 48.0396 ns 3.8066 ns 0.36
NaiveTrySliceTo 1023 925.8524 ns 10.8674 ns 1.00
VectorTernaryTrySliceTo 1023 252.0601 ns 5.1844 ns 0.27
VectorMagicTrySliceTo 1023 236.0024 ns 3.4854 ns 0.26
VectorLongMagicTrySliceTo 1023 226.1428 ns 4.2042 ns 0.25
NaiveTrySliceTo 1024 917.5681 ns 12.3548 ns 1.00
VectorTernaryTrySliceTo 1024 236.1087 ns 6.6068 ns 0.26
VectorMagicTrySliceTo 1024 232.7630 ns 3.9273 ns 0.25
VectorLongMagicTrySliceTo 1024 238.8622 ns 8.6876 ns 0.26
NaiveTrySliceTo 1032 940.5847 ns 11.5850 ns 1.00
VectorTernaryTrySliceTo 1032 238.3433 ns 5.6128 ns 0.26
VectorMagicTrySliceTo 1032 230.8612 ns 4.3100 ns 0.25
VectorLongMagicTrySliceTo 1032 220.9751 ns 3.7844 ns 0.24

@benaadams
Copy link
Contributor Author

benaadams commented Oct 16, 2016

When Vector<byte>.Count == 32

BenchmarkDotNet.Core=v0.9.9.0
Processor=Intel(R) Core(TM) i7-4720HQ CPU 2.60GHz, ProcessorCount=8
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]

Type=TrySliceToBenchmark Mode=Throughput Platform=X64
Jit=RyuJit Toolchain=Clr Runtime=Clr

Method Size Median StdDev Scaled
NaiveTrySliceTo 1 13.1224 ns 0.2708 ns 1.00
VectorTernaryTrySliceTo 1 23.4971 ns 0.5550 ns 1.80
VectorMagicTrySliceTo 1 22.5685 ns 0.3495 ns 1.72
VectorLongMagicTrySliceTo 1 22.3895 ns 0.4108 ns 1.70
NaiveTrySliceTo 2 13.6805 ns 0.9513 ns 1.00
VectorTernaryTrySliceTo 2 24.2508 ns 0.4448 ns 1.71
VectorMagicTrySliceTo 2 23.6243 ns 0.3516 ns 1.66
VectorLongMagicTrySliceTo 2 23.5750 ns 0.3723 ns 1.66
NaiveTrySliceTo 4 15.3765 ns 0.2881 ns 1.00
VectorTernaryTrySliceTo 4 26.2891 ns 0.4192 ns 1.71
VectorMagicTrySliceTo 4 25.5430 ns 0.4636 ns 1.66
VectorLongMagicTrySliceTo 4 25.9806 ns 0.3686 ns 1.70
NaiveTrySliceTo 7 18.0889 ns 0.2337 ns 1.00
VectorTernaryTrySliceTo 7 29.7649 ns 0.7047 ns 1.64
VectorMagicTrySliceTo 7 29.5158 ns 0.6820 ns 1.63
VectorLongMagicTrySliceTo 7 28.8846 ns 0.5705 ns 1.58
NaiveTrySliceTo 8 18.9347 ns 0.6359 ns 1.00
VectorTernaryTrySliceTo 8 29.9819 ns 0.5715 ns 1.58
VectorMagicTrySliceTo 8 30.0269 ns 0.3408 ns 1.59
VectorLongMagicTrySliceTo 8 23.8496 ns 0.3687 ns 1.26
NaiveTrySliceTo 15 25.1381 ns 0.3085 ns 1.00
VectorTernaryTrySliceTo 15 36.1487 ns 0.3825 ns 1.43
VectorMagicTrySliceTo 15 37.1586 ns 0.7013 ns 1.48
VectorLongMagicTrySliceTo 15 32.2881 ns 0.4374 ns 1.28
NaiveTrySliceTo 16 34.6617 ns 1.0395 ns 1.00
VectorTernaryTrySliceTo 16 37.0704 ns 0.2338 ns 1.08
VectorMagicTrySliceTo 16 38.3068 ns 0.2642 ns 1.12
VectorLongMagicTrySliceTo 16 28.3099 ns 0.4729 ns 0.82
NaiveTrySliceTo 31 39.4123 ns 0.4550 ns 1.00
VectorTernaryTrySliceTo 31 50.4023 ns 0.6387 ns 1.28
VectorMagicTrySliceTo 31 62.9910 ns 1.0847 ns 1.59
VectorLongMagicTrySliceTo 31 38.9832 ns 0.7505 ns 0.98
NaiveTrySliceTo 32 40.3215 ns 0.4165 ns 1.00
VectorTernaryTrySliceTo 32 40.0326 ns 0.9143 ns 0.99
VectorMagicTrySliceTo 32 26.9822 ns 0.4375 ns 0.67
VectorLongMagicTrySliceTo 32 26.7081 ns 0.4662 ns 0.67
NaiveTrySliceTo 40 58.0555 ns 1.0652 ns 1.00
VectorTernaryTrySliceTo 40 32.7064 ns 0.8279 ns 0.56
VectorMagicTrySliceTo 40 34.1471 ns 0.6374 ns 0.58
VectorLongMagicTrySliceTo 40 27.7423 ns 0.4539 ns 0.47
NaiveTrySliceTo 63 74.6739 ns 0.9507 ns 1.00
VectorTernaryTrySliceTo 63 53.8320 ns 2.2042 ns 0.74
VectorMagicTrySliceTo 63 61.1816 ns 2.3135 ns 0.84
VectorLongMagicTrySliceTo 63 41.8265 ns 0.6758 ns 0.57
NaiveTrySliceTo 64 76.1424 ns 1.0445 ns 1.00
VectorTernaryTrySliceTo 64 43.0024 ns 1.1439 ns 0.57
VectorMagicTrySliceTo 64 31.4096 ns 0.9771 ns 0.41
VectorLongMagicTrySliceTo 64 31.9771 ns 0.6103 ns 0.42
NaiveTrySliceTo 120 117.0855 ns 0.9939 ns 1.00
VectorTernaryTrySliceTo 120 54.5789 ns 1.0977 ns 0.46
VectorMagicTrySliceTo 120 57.2428 ns 0.4922 ns 0.49
VectorLongMagicTrySliceTo 120 42.7898 ns 0.3898 ns 0.37
NaiveTrySliceTo 127 130.6791 ns 0.9210 ns 1.00
VectorTernaryTrySliceTo 127 59.1220 ns 1.2679 ns 0.46
VectorMagicTrySliceTo 127 64.6583 ns 0.3360 ns 0.49
VectorLongMagicTrySliceTo 127 50.0398 ns 1.0123 ns 0.38
NaiveTrySliceTo 128 124.4485 ns 0.9432 ns 1.00
VectorTernaryTrySliceTo 128 49.0869 ns 2.2919 ns 0.40
VectorMagicTrySliceTo 128 38.3122 ns 2.7265 ns 0.31
VectorLongMagicTrySliceTo 128 39.0482 ns 2.2181 ns 0.31
NaiveTrySliceTo 1023 916.2417 ns 15.6441 ns 1.00
VectorTernaryTrySliceTo 1023 170.3470 ns 5.8421 ns 0.19
VectorMagicTrySliceTo 1023 166.7208 ns 4.0725 ns 0.18
VectorLongMagicTrySliceTo 1023 142.9391 ns 2.5784 ns 0.16
NaiveTrySliceTo 1024 921.8478 ns 12.3803 ns 1.00
VectorTernaryTrySliceTo 1024 157.4972 ns 5.3816 ns 0.17
VectorMagicTrySliceTo 1024 142.3913 ns 3.8491 ns 0.16
VectorLongMagicTrySliceTo 1024 142.3249 ns 4.4722 ns 0.16
NaiveTrySliceTo 1032 927.0191 ns 12.8284 ns 1.00
VectorTernaryTrySliceTo 1032 143.7824 ns 5.3965 ns 0.15
VectorMagicTrySliceTo 1032 138.4441 ns 3.2600 ns 0.15
VectorLongMagicTrySliceTo 1032 134.0852 ns 3.3459 ns 0.15

@benaadams
Copy link
Contributor Author

The Vector and long paths are slicing per loop, whereas the naive is just iterating an array; so the slice doesn't seem to add too much overhead

@KodrAus
Copy link

KodrAus commented Oct 16, 2016

@benaadams Is there a different path taken for *MagicTrySliceTo when the length is not a multiple of 8 that's killing its runtime? Is that actually something that comes up in practice since the input to this function is a Vector?

@benaadams
Copy link
Contributor Author

@KodrAus its the byte by byte iteration that happens at the end to pick up the remaining bytes when the input is not a multiple of vector and the search term has not been found (Match is very last byte in all these tests; so it has to get to the end)

E.g. on AVX that can be 31 bytes

@benaadams
Copy link
Contributor Author

(i.e. this is worse case searching)

@KodrAus
Copy link

KodrAus commented Oct 16, 2016

Gotcha 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants