Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd: translate fseDecoder.buildDtable into asm #598

Merged
merged 2 commits into from
Jun 20, 2022

Conversation

WojciechMula
Copy link
Contributor

@WojciechMula WojciechMula commented May 20, 2022

In our tests, buildDtable takes 6-7% of the total time, so seems it's worth trying to make it a bit faster.

Below are the results from a Skylake machine (my IceLake AWS instance is temporarily unavailable). No significantly faster, but overall a faster and just a few regressions.

# go test -run XYZ -bench BenchmarkDecoder
benchmark                                                                old ns/op     new ns/op     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-4                            4153807       4115429       -0.92%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-4                        1053788       1040451       -1.27%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-4                         13650117      13593589      -0.41%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-4                           10812922      10667411      -1.35%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-4                         3447687       3429851       -0.52%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-4                          4262521       4225354       -0.87%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-4                             1785442       1771480       -0.78%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-4                       249628        237774        -4.75%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-4                       142184        142433        +0.18%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-4                             13100333      13135150      +0.27%
BenchmarkDecoder_DecoderSmall/html.zst-4                                 1209631       1176038       -2.78%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-4                        75525         74250         -1.69%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-4                               345085        339763        -1.54%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-4                           93202         89823         -3.63%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-4                            1060139       1048077       -1.14%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-4                              949172        940920        -0.87%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-4                            281470        281612        +0.05%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-4                             334139        328436        -1.71%
BenchmarkDecoder_DecodeAll/html_x_4.zst-4                                344507        340544        -1.15%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-4                          22584         21855         -3.23%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-4                          11244         11228         -0.14%
BenchmarkDecoder_DecodeAll/urls.10K.zst-4                                1039905       1021403       -1.78%
BenchmarkDecoder_DecodeAll/html.zst-4                                    97440         93920         -3.61%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-4                           9481          9291          -2.00%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-4      897903        891388        -0.73%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-4      833709        830127        -0.43%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-4       809105        809356        +0.03%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-4         812710        808218        -0.55%
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-4                          9172          9187          +0.16%
BenchmarkDecoder_DecodeAllFiles/e.txt/default-4                          241586        240073        -0.63%
BenchmarkDecoder_DecodeAllFiles/e.txt/better-4                           226332        227843        +0.67%
BenchmarkDecoder_DecodeAllFiles/e.txt/best-4                             206827        203462        -1.63%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-4              3994          3993          -0.03%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-4              3523          3504          -0.54%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-4               4314          4328          +0.32%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-4                 8749          8754          +0.06%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-4                 4974          4983          +0.18%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-4                 6246          6289          +0.69%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-4                  6278          6256          -0.35%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-4                    6397          6391          -0.09%
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-4                       72596         70633         -2.70%
BenchmarkDecoder_DecodeAllFiles/html.txt/default-4                       73196         70761         -3.33%
BenchmarkDecoder_DecodeAllFiles/html.txt/better-4                        70028         68187         -2.63%
BenchmarkDecoder_DecodeAllFiles/html.txt/best-4                          73537         72204         -1.81%
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-4                         9225          9177          -0.52%
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-4                         242342        241565        -0.32%
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-4                          225453        224727        -0.32%
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-4                            206138        205345        -0.38%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-4                    35372         35066         -0.87%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-4                    36792         36745         -0.13%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-4                     27590         27533         -0.21%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-4                       40185         40050         -0.34%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-4                     9210          9169          -0.45%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-4                     9122          9163          +0.45%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-4                      9190          9117          -0.79%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-4                        9121          9130          +0.10%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-4     405244        399090        -1.52%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-4     386821        388027        +0.31%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-4      374320        366639        -2.05%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-4        374411        372045        -0.63%
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-4                         4456          4452          -0.09%
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-4                         109223        110546        +1.21%
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-4                          100505        98860         -1.64%
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-4                            86812         87460         +0.75%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-4             1765          1777          +0.68%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-4             1625          1619          -0.37%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-4              1997          2003          +0.30%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-4                3087          3205          +3.82%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-4                2340          2382          +1.79%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-4                2523          2512          -0.44%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-4                 2485          2511          +1.05%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-4                   2649          2694          +1.70%
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-4                      34176         33173         -2.93%
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-4                      34407         33186         -3.55%
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-4                       33023         32581         -1.34%
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-4                         34486         33471         -2.94%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-4                        4517          4443          -1.64%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-4                        110769        109713        -0.95%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-4                         99117         98769         -0.35%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-4                           86987         86157         -0.95%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-4                   15665         16068         +2.57%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-4                   15691         16235         +3.47%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-4                    11975         12118         +1.19%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-4                      17022         17635         +3.60%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-4                    4468          4517          +1.10%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-4                    4444          4459          +0.34%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-4                     4479          4474          -0.11%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-4                       4443          4462          +0.43%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-4                       160737        158103        -1.64%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-4                   43385         41730         -3.81%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-4                    494467        489708        -0.96%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-4                      418782        416916        -0.45%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-4                    128830        127791        -0.81%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-4                     159944        157321        -1.64%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-4                        159625        154610        -3.14%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-4                  10789         9791          -9.25%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-4                  5440          5517          +1.42%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-4                        462831        445809        -3.68%
BenchmarkDecoder_DecodeAllParallel/html.zst-4                            45181         44277         -2.00%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-4                   4463          4450          -0.29%

@WojciechMula WojciechMula force-pushed the zstd-fse-build-dtable branch from 07789a8 to f8322cd Compare June 17, 2022 12:47
@WojciechMula
Copy link
Contributor Author

I paused development but finally, get back to this translation. (Some tests still don't pass and for now, I can't figure out why.)

@klauspost
Copy link
Owner

@WojciechMula Great to have you back! What is the failure?

@WojciechMula
Copy link
Contributor Author

@klauspost A funny thing - when reviewing diff on github minutes ago I immediately spotted a mistake. :) I put wrong condition in loop (JLE rather JL).

@klauspost
Copy link
Owner

@WojciechMula Having some time away from code will often do that :)

@WojciechMula WojciechMula force-pushed the zstd-fse-build-dtable branch from f8322cd to c2f7314 Compare June 17, 2022 14:35
@WojciechMula WojciechMula marked this pull request as ready for review June 17, 2022 15:01
@WojciechMula WojciechMula requested a review from klauspost June 17, 2022 15:01
@WojciechMula
Copy link
Contributor Author

@WojciechMula Having some time away from code will often do that :)

I have many years of experience, but this property of our brain still amazes me. :)

MOVWQZX(ptr, nextState)

// symbolNext[symbol] = nextState + 1
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe simple INCQ(ptr) won't be bad?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick benchmarks showed more regressions, but maybe it's just noise.

}

// newState := (nextState << nBits) - tableSize
newState := GP64()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Copy64

SHLQ(reg.CL, newState)
SUBQ(b.tableSize, newState)

{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove, some remnants from debugging

Copy link
Owner

@klauspost klauspost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Speedup is what you can expect, since overall usage is so low.

@klauspost klauspost merged commit 9bbb415 into klauspost:master Jun 20, 2022
@klauspost klauspost deleted the zstd-fse-build-dtable branch June 20, 2022 08:42
klauspost added a commit that referenced this pull request Sep 24, 2022
Regression from #598 causing excessive heap allocations.
klauspost added a commit that referenced this pull request Sep 25, 2022
Regression from #598 causing excessive heap allocations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants