Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Qute Escaper to be as branch-free as possible #45546

Merged
merged 1 commit into from
Feb 10, 2025

Conversation

franz1981
Copy link
Contributor

@franz1981 franz1981 commented Jan 13, 2025

This is a show case of the approach at lemire/Code-used-on-Daniel-Lemire-s-blog#116 but biased for the one and two replacement latin chars case.
It could be expanded to cover for non-latin and the 6 bytes replacement too, with more painfull and complex (in term of logic) changes - but it looks already complex as it is IMO.

Feedbacks are wellcome as questions.
I didn't yet benchmarked it (is sadly low in my prio list - so IDK when I'll have time to contribute a proper bench which stress the branch-predictor in qute-benchmark) - and there's a good chance my "bet" to use StringBuilder to simplify it, making use of the horrible setLength in the hot path, won't pay off.
In such unfortunate case, I will move to using char[] despite it requires latinness checks due to compact string, on String construction :"(

@quarkus-bot quarkus-bot bot added the area/qute The template engine label Jan 13, 2025
@franz1981 franz1981 marked this pull request as ready for review January 15, 2025 13:55
@franz1981
Copy link
Contributor Author

franz1981 commented Jan 15, 2025

Using the benchmark at mkouba/qute-benchmarks#1

shows these differences in perf - now:

Benchmark                   (ctrlProbabibility)  (latinCharsProbability)  (replacementProbability)  (samples)  (size)  Mode  Cnt   Score   Error  Units
JsonEscaping.escape                           0                      100                         0        100      32  avgt   10  10.129 ± 0.009  ns/op
JsonEscaping.escape                           0                      100                         0      10000      32  avgt   10  10.129 ± 0.038  ns/op
JsonEscaping.escape                           0                      100                        10        100      32  avgt   10  45.312 ± 0.193  ns/op
JsonEscaping.escape                           0                      100                        10      10000      32  avgt   10  55.989 ± 1.826  ns/op
JsonEscaping.escape                          10                      100                        10        100      32  avgt   10  44.131 ± 2.367  ns/op
JsonEscaping.escape                          10                      100                        10      10000      32  avgt   10  77.187 ± 2.496  ns/op

vs before

Benchmark            (ctrlProbabibility)  (latinCharsProbability)  (replacementProbability)  (samples)  (size)  Mode  Cnt    Score    Error  Units
JsonEscaping.escape                    0                      100                         0        100      32  avgt   10   16.864 ±  0.514  ns/op
JsonEscaping.escape                    0                      100                         0      10000      32  avgt   10   66.942 ±  0.524  ns/op
JsonEscaping.escape                    0                      100                        10        100      32  avgt   10   97.493 ±  5.357  ns/op
JsonEscaping.escape                    0                      100                        10      10000      32  avgt   10  220.471 ±  7.429  ns/op
JsonEscaping.escape                   10                      100                        10        100      32  avgt   10  178.098 ± 15.693  ns/op
JsonEscaping.escape                   10                      100                        10      10000      32  avgt   10  319.793 ± 40.863  ns/op

which is a good improvement - which tends to pay-off more as the number of chars increases

Fyi @mkouba the relevant ones are using 10000 samples since with just 100 the data are very predictable for my CPU model and you cannot really see the benefits of reducing the number of branches

This comment has been minimized.

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 15, 2025

@galderz on mkouba/qute-benchmarks#1

this implementation should really shine with native image, since it doesn't use StringBuilder :)

This comment has been minimized.

@mkouba
Copy link
Contributor

mkouba commented Jan 17, 2025

Using the benchmark at mkouba/qute-benchmarks#1

I had to rewrite the JsonEscaping benchmark so that it's using a template instead of JsonEscaper because this class was only introduced in 3.18 and we need to compile/run benchmarks for older versions as well...

shows these differences in perf with before:

Benchmark                   (ctrlProbabibility)  (latinCharsProbability)  (replacementProbability)  (samples)  (size)  Mode  Cnt   Score   Error  Units
JsonEscaping.escape                           0                      100                         0        100      32  avgt   10  10.129 ± 0.009  ns/op
JsonEscaping.escape                           0                      100                         0      10000      32  avgt   10  10.129 ± 0.038  ns/op
JsonEscaping.escape                           0                      100                        10        100      32  avgt   10  45.312 ± 0.193  ns/op
JsonEscaping.escape                           0                      100                        10      10000      32  avgt   10  55.989 ± 1.826  ns/op
JsonEscaping.escape                          10                      100                        10        100      32  avgt   10  44.131 ± 2.367  ns/op
JsonEscaping.escape                          10                      100                        10      10000      32  avgt   10  77.187 ± 2.496  ns/op

vs 7ec85cc

Benchmark            (ctrlProbabibility)  (latinCharsProbability)  (replacementProbability)  (samples)  (size)  Mode  Cnt    Score    Error  Units
JsonEscaping.escape                    0                      100                         0        100      32  avgt   10   16.864 ±  0.514  ns/op
JsonEscaping.escape                    0                      100                         0      10000      32  avgt   10   66.942 ±  0.524  ns/op
JsonEscaping.escape                    0                      100                        10        100      32  avgt   10   97.493 ±  5.357  ns/op
JsonEscaping.escape                    0                      100                        10      10000      32  avgt   10  220.471 ±  7.429  ns/op
JsonEscaping.escape                   10                      100                        10        100      32  avgt   10  178.098 ± 15.693  ns/op
JsonEscaping.escape                   10                      100                        10      10000      32  avgt   10  319.793 ± 40.863  ns/op

which is a good improvement - which tends to pay-off more as the number of chars increases

Aren't those numbers swapped given the fact that avgt mode is used (average time per per operation - lower is better)?

Fyi @mkouba the relevant ones are using 10000 samples since with just 100 the data are very predictable for my CPU model and you cannot really see the benefits of reducing the number of branches

Ok 👍

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 17, 2025

using a template instead of JsonEscaper

Numbers still look the same? Did you checked 🙏?

Aren't those numbers swapped given the fact that avgt mode is used (average time per per operation - lower is better)?

Ops 🤣 I copied in the wrong order , let me fix it

@franz1981 franz1981 marked this pull request as draft January 20, 2025 15:02
@franz1981
Copy link
Contributor Author

The HTML one can still made faster, working on it

@quarkus-bot quarkus-bot bot added area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins area/documentation labels Jan 20, 2025
@franz1981 franz1981 marked this pull request as ready for review January 20, 2025 15:34
Copy link

quarkus-bot bot commented Jan 20, 2025

Status for workflow Quarkus Documentation CI

This is the status report for running Quarkus Documentation CI on commit d4d3d81.

✅ The latest workflow run for the pull request has completed successfully.

It should be safe to merge provided you have a look at the other checks in the summary.

Warning

There are other workflow runs running, you probably need to wait for their status before merging.

Copy link

🎊 PR Preview 04deb60 has been successfully built and deployed to https://quarkus-pr-main-45546-preview.surge.sh/version/main/guides/

  • Images of blog posts older than 3 months are not available.
  • Newsletters older than 3 months are not available.

This comment has been minimized.

This comment has been minimized.

@franz1981
Copy link
Contributor Author

I will provide soon some up-to-date numbers of the html version too

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 27, 2025

These are the numbers while using the HtmlEscaping benchmark at mkouba/qute-benchmarks#2:

Benchmark            (branchfull)  (latinCharsProbability)  (replacementProbability)  (samples)  (size)  Mode  Cnt    Score    Error  Units
HtmlEscaping.escape          true                      100                         0        100      32  avgt   10   59.623 ±  7.042  ns/op
HtmlEscaping.escape          true                      100                         0      10000      32  avgt   10   83.220 ±  0.278  ns/op
HtmlEscaping.escape          true                      100                        10        100      32  avgt   10  144.831 ±  7.340  ns/op
HtmlEscaping.escape          true                      100                        10      10000      32  avgt   10  253.461 ±  1.642  ns/op
HtmlEscaping.escape         false                      100                         0        100      32  avgt   10   68.238 ±  0.359  ns/op
HtmlEscaping.escape         false                      100                         0      10000      32  avgt   10   69.189 ±  3.947  ns/op
HtmlEscaping.escape         false                      100                        10        100      32  avgt   10  167.388 ± 10.995  ns/op
HtmlEscaping.escape         false                      100                        10      10000      32  avgt   10  204.317 ±  2.408  ns/op

Which, as anticipated, are not "wow" - with a small regression in case the pattern of escaping is fully predictable (see 59.6 ns/op vs 68 ns/op), which makes the JIT able to replace the switch with a branch guard, which basically perform a single comparison in case no escaping is needed, falling back to the switch only if it's not matched.
In addition to this, since the number of samples is pretty low (just 100) it enables the super good branchpredictor of Ryzen to shine - and make such branch well predicted, turning the numbers in favour of the branchfull case, but if we use a more realistic (considered still that it's a synthetic test scenario) number of samples to mimic unpredictability of the world (i.e. 10K samples) this is what it looks like:

Benchmark            (branchfull)  (latinCharsProbability)  (replacementProbability)  (samples)  (size)  Mode  Cnt    Score    Error  Units
HtmlEscaping.escape          true                      100                         0      10000      32  avgt   10   83.220 ±  0.278  ns/op
HtmlEscaping.escape          true                      100                        10      10000      32  avgt   10  253.461 ±  1.642  ns/op
HtmlEscaping.escape         false                      100                         0      10000      32  avgt   10   69.189 ±  3.947  ns/op
HtmlEscaping.escape         false                      100                        10      10000      32  avgt   10  204.317 ±  2.408  ns/op

Still not a huge win - since I didn't modified the original algorithm at CharReplacementResultMapper (which sadly uses StringBuilders - which doesn't perform great in native mode and leave the JVM mode to have a much worse unrolling factor) - but decent i.e. ~20% speedup.
I could make it much faster if, similar to the Json version, I'll fully move into using char[], but here the other limiting factor is that Html escaping have much bigger replacements (i.e. 6 chars) for "common" escaping - and it means that, in order to reduce the footprint, I had to use 2 look-up tables to first search the index of the replacement and later replace it, which is

256 bytes byte[] for the index + String[8] with the replacements

This "double lookup" form a data-dependency chain, since we have to load the index before accessing the replacement which is the most relevant factor compared to the original code which just need to correctly speculate the result of the branch instruction (the switch) - making it almost free, when correctly predicted.

A better approach would use a long[256] (similar to Json), but would requires going "all in" and not use Strings for the replacements (since these would be reconstructured from the long) and pack in each long the length of the replacement (if any) with the replacement encoded into.
And clearly this would prevent me to reuse the existing CharReplacementResultMapper since the API is just not "right".

Let me know if you have further questions.

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 27, 2025

I've further pushed 1cd3ae3 which is using some knowledge of how branch prediction work (which can use some history of the previous taken branch address not just the current one) to improve it another bit, although without modifying the base class.
This improvement will be more important as the number of chars grows. since the most of the test cost, because we use the whole template, is not just in the replacement itself (which is ~12% of the total cost, for the 32 bytes case)

This comment has been minimized.

Copy link
Contributor

@mkouba mkouba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@franz1981 Could you pls squash the commits before we merge?

@franz1981
Copy link
Contributor Author

Done @mkouba !

This comment has been minimized.

Copy link

quarkus-bot bot commented Feb 10, 2025

Status for workflow Quarkus CI

This is the status report for running Quarkus CI on commit b57c128.

Failing Jobs

Status Name Step Failures Logs Raw logs Build scan
JVM Tests - JDK 17 Download previously uploaded .m2 content ⚠️ Check → Logs Raw logs 🚧
✔️ JVM Tests - JDK 21 Logs Raw logs 🚧

@mkouba mkouba merged commit f0a5533 into quarkusio:main Feb 10, 2025
52 of 53 checks passed
@quarkus-bot quarkus-bot bot added this to the 3.19 - main milestone Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins area/documentation area/qute The template engine triage/flaky-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants