-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zstd: Branchless getBits for amd64 w/o BMI2 #640
Conversation
This produces the same number of instructions, while requiring less generating code. Benchmarks on the Intel Core i7-3770K show a tiny speedup: name old speed new speed delta Decoder_DecoderSmall/kppkn.gtb.zst-8 430MB/s ± 1% 437MB/s ± 1% +1.60% (p=0.000 n=10+9) Decoder_DecoderSmall/geo.protodata.zst-8 1.11GB/s ± 1% 1.13GB/s ± 0% +1.37% (p=0.000 n=9+9) Decoder_DecoderSmall/plrabn12.txt.zst-8 334MB/s ± 1% 339MB/s ± 1% +1.41% (p=0.000 n=9+10) Decoder_DecoderSmall/lcet10.txt.zst-8 392MB/s ± 2% 404MB/s ± 1% +3.05% (p=0.000 n=10+10) Decoder_DecoderSmall/asyoulik.txt.zst-8 355MB/s ± 2% 357MB/s ± 1% ~ (p=0.315 n=10+9) Decoder_DecoderSmall/alice29.txt.zst-8 344MB/s ± 1% 350MB/s ± 1% +1.69% (p=0.000 n=10+10) Decoder_DecoderSmall/html_x_4.zst-8 2.34GB/s ± 1% 2.37GB/s ± 1% +1.10% (p=0.000 n=10+10) Decoder_DecoderSmall/paper-100k.pdf.zst-8 3.75GB/s ± 0% 3.76GB/s ± 1% ~ (p=0.182 n=9+10) Decoder_DecoderSmall/fireworks.jpeg.zst-8 8.59GB/s ± 1% 8.58GB/s ± 1% ~ (p=0.842 n=10+9) Decoder_DecoderSmall/urls.10K.zst-8 561MB/s ± 1% 556MB/s ± 1% -0.82% (p=0.019 n=10+10) Decoder_DecoderSmall/html.zst-8 900MB/s ± 1% 913MB/s ± 1% +1.42% (p=0.000 n=10+9) Decoder_DecoderSmall/comp-data.bin.zst-8 399MB/s ± 1% 395MB/s ± 1% -0.99% (p=0.000 n=10+10) Decoder_DecodeAll/kppkn.gtb.zst-8 518MB/s ± 0% 526MB/s ± 0% +1.52% (p=0.000 n=10+9) Decoder_DecodeAll/geo.protodata.zst-8 1.28GB/s ± 0% 1.27GB/s ± 2% ~ (p=0.739 n=10+10) Decoder_DecodeAll/plrabn12.txt.zst-8 427MB/s ± 1% 433MB/s ± 1% +1.24% (p=0.000 n=10+10) Decoder_DecodeAll/lcet10.txt.zst-8 480MB/s ± 1% 490MB/s ± 1% +2.06% (p=0.000 n=10+10) Decoder_DecodeAll/asyoulik.txt.zst-8 435MB/s ± 0% 447MB/s ± 0% +2.70% (p=0.000 n=7+9) Decoder_DecodeAll/alice29.txt.zst-8 422MB/s ± 0% 438MB/s ± 1% +3.96% (p=0.000 n=8+9) Decoder_DecodeAll/html_x_4.zst-8 1.60GB/s ± 0% 1.61GB/s ± 0% +0.99% (p=0.000 n=9+10) Decoder_DecodeAll/paper-100k.pdf.zst-8 4.55GB/s ± 1% 4.44GB/s ± 1% -2.42% (p=0.000 n=10+10) Decoder_DecodeAll/fireworks.jpeg.zst-8 9.52GB/s ± 1% 9.47GB/s ± 2% ~ (p=0.143 n=10+10) Decoder_DecodeAll/urls.10K.zst-8 678MB/s ± 1% 684MB/s ± 0% +0.83% (p=0.000 n=10+10) Decoder_DecodeAll/html.zst-8 1.05GB/s ± 0% 1.07GB/s ± 1% +2.11% (p=0.000 n=10+10) Decoder_DecodeAll/comp-data.bin.zst-8 397MB/s ± 1% 391MB/s ± 1% -1.37% (p=0.000 n=10+10) Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-8 437MB/s ± 0% 436MB/s ± 1% -0.21% (p=0.025 n=9+9) Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-8 448MB/s ± 0% 451MB/s ± 0% +0.70% (p=0.000 n=9+9) Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-8 478MB/s ± 0% 475MB/s ± 0% -0.53% (p=0.000 n=10+10) Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-8 461MB/s ± 0% 470MB/s ± 0% +2.07% (p=0.000 n=8+9) Decoder_DecodeAllFiles/e.txt/fastest-8 9.62GB/s ± 3% 9.62GB/s ± 2% ~ (p=1.000 n=10+10) Decoder_DecodeAllFiles/e.txt/default-8 391MB/s ± 0% 406MB/s ± 0% +3.81% (p=0.000 n=10+8) Decoder_DecodeAllFiles/e.txt/better-8 438MB/s ± 0% 448MB/s ± 0% +2.39% (p=0.000 n=8+10) Decoder_DecodeAllFiles/e.txt/best-8 500MB/s ± 0% 500MB/s ± 0% ~ (p=0.119 n=9+9) Decoder_DecodeAllFiles/fse-artifact3.bin/fastest-8 1.07GB/s ± 1% 1.04GB/s ± 1% -2.61% (p=0.000 n=10+10) Decoder_DecodeAllFiles/fse-artifact3.bin/default-8 1.21GB/s ± 1% 1.19GB/s ± 1% -1.33% (p=0.000 n=10+10) Decoder_DecodeAllFiles/fse-artifact3.bin/better-8 994MB/s ± 0% 990MB/s ± 0% -0.42% (p=0.002 n=10+9) Decoder_DecodeAllFiles/fse-artifact3.bin/best-8 389MB/s ± 0% 381MB/s ± 0% -2.00% (p=0.000 n=8+10) Decoder_DecodeAllFiles/gettysburg.txt/fastest-8 274MB/s ± 1% 274MB/s ± 1% ~ (p=1.000 n=10+10) Decoder_DecodeAllFiles/gettysburg.txt/default-8 224MB/s ± 1% 223MB/s ± 1% -0.64% (p=0.015 n=10+10) Decoder_DecodeAllFiles/gettysburg.txt/better-8 228MB/s ± 1% 227MB/s ± 1% -0.40% (p=0.041 n=10+10) Decoder_DecodeAllFiles/gettysburg.txt/best-8 225MB/s ± 1% 223MB/s ± 0% -0.52% (p=0.008 n=10+6) Decoder_DecodeAllFiles/html.txt/fastest-8 599MB/s ± 1% 614MB/s ± 1% +2.41% (p=0.000 n=10+10) Decoder_DecodeAllFiles/html.txt/default-8 601MB/s ± 0% 613MB/s ± 0% +2.01% (p=0.000 n=8+9) Decoder_DecodeAllFiles/html.txt/better-8 626MB/s ± 1% 638MB/s ± 0% +1.99% (p=0.000 n=10+10) Decoder_DecodeAllFiles/html.txt/best-8 601MB/s ± 0% 612MB/s ± 0% +1.87% (p=0.000 n=10+10) Decoder_DecodeAllFiles/pi.txt/fastest-8 9.64GB/s ± 2% 9.66GB/s ± 1% ~ (p=0.529 n=10+10) Decoder_DecodeAllFiles/pi.txt/default-8 390MB/s ± 0% 403MB/s ± 0% +3.48% (p=0.000 n=10+10) Decoder_DecodeAllFiles/pi.txt/better-8 439MB/s ± 0% 451MB/s ± 0% +2.65% (p=0.000 n=10+10) Decoder_DecodeAllFiles/pi.txt/best-8 500MB/s ± 0% 499MB/s ± 0% -0.27% (p=0.009 n=7+10) Decoder_DecodeAllFiles/pngdata.bin/fastest-8 1.70GB/s ± 1% 1.69GB/s ± 1% -0.63% (p=0.013 n=10+9) Decoder_DecodeAllFiles/pngdata.bin/default-8 1.52GB/s ± 1% 1.51GB/s ± 0% -0.75% (p=0.000 n=10+9) Decoder_DecodeAllFiles/pngdata.bin/better-8 1.92GB/s ± 0% 1.90GB/s ± 0% -1.02% (p=0.000 n=10+10) Decoder_DecodeAllFiles/pngdata.bin/best-8 1.47GB/s ± 0% 1.46GB/s ± 0% -0.88% (p=0.000 n=10+9) Decoder_DecodeAllFiles/sharnd.out/fastest-8 9.60GB/s ± 1% 9.67GB/s ± 1% +0.67% (p=0.029 n=10+10) Decoder_DecodeAllFiles/sharnd.out/default-8 9.65GB/s ± 2% 9.71GB/s ± 1% ~ (p=0.353 n=10+10) Decoder_DecodeAllFiles/sharnd.out/better-8 9.67GB/s ± 1% 9.66GB/s ± 0% ~ (p=0.549 n=10+9) Decoder_DecodeAllFiles/sharnd.out/best-8 9.70GB/s ± 1% 9.61GB/s ± 0% -0.91% (p=0.010 n=10+9) [Geo mean] 935MB/s 940MB/s +0.57%
Nice! I will do a few tests and merge if no problems show up. |
I remember adding this, but abandoning it, since it wasn't a clear result. I reran the test with this patch, and the results are up and down:
With BMI it is a win, but without it seems worse for quite a few cases. Maybe using this for BMI and use the old method for x64 would be the best? |
This patch doesn't change the BMI2 path. So using this for BMI2 is just dropping the patch :) I see similar results on my machine: better on some benchmarks, worse on others. I find it hard to tell how these benchmarks relate to the more end-to-end Decode* benchmarks. Also, code size still goes down.
|
I wonder if something else is affecting the benchmarks, since all of the ones you posted are with bmi. Only |
I should maybe have said this earlier, but my CPU does not have the BMI2 instructions. |
@greatroar Ah, ok :) Actually that is great since it is much more relevant that there is a speedup on your cpu than mine.
|
This produces the same number of instructions, while requiring less generating code. Benchmarks on the Intel Core i7-3770K show a tiny speedup: