Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage of method handle invocations in Jetty 10 #6328

Closed
SerCeMan opened this issue May 27, 2021 · 24 comments
Closed

High CPU usage of method handle invocations in Jetty 10 #6328

SerCeMan opened this issue May 27, 2021 · 24 comments
Assignees
Labels
Stale For auto-closed stale issues and pull requests

Comments

@SerCeMan
Copy link
Contributor

Jetty version

10.0.3

Java version/vendor (use: java -version)

openjdk version "13.0.4" 2020-07-14
OpenJDK Runtime Environment Zulu13.33+25-CA (build 13.0.4+8-MTS)
OpenJDK 64-Bit Server VM Zulu13.33+25-CA (build 13.0.4+8-MTS, mixed mode, sharing)

OS type/version

Ubuntu 18.04.5 LTS

Description

Hi, Jetty maintainers!

We've recently attempted a migration from Jetty 9 to Jetty 10 and we've noticed a regression related to WebSockets. According to our metrics, there seems to be a memory leak which I'm still currently investigating and I hope to provide more information soon. However, it also seems that Jetty 10 spends a large amount of CPU resolving lambda forms to perform method handle invocations inside. On the flame graphs, we saw a large number of lambdaforms - every tiny green tower on the flame graph is a separate instance of LabmdaForm. My assumption which I'm currently investigating and hope to provide more data soon is that the large number of labmdaforms filled up the java heap.

CPU flame graph:
Screen Shot 2021-05-27 at 2 23 30 pm

Allocation flame graph:
Screen Shot 2021-05-27 at 2 24 17 pm

I was wondering if you've observed this behaviour before or might know what it could be caused by. I'm also currently investigating the issue and will provide more info once I have it. Thanks!

@sbordet
Copy link
Contributor

sbordet commented May 31, 2021

@SerCeMan we do see the tiny green towers in a quick load test that we have written to try to reproduce this issue.

We would like to know:

  • Are you using annotated WebSocket endpoints, or you implement WebSocketListener?
  • I'm assuming your messages all have different sizes, or is your flamegraph the result of some load test with fixed size messages?
  • While we do see the tiny green towers, they account for less than 1% of CPU time. What is your figure of CPU time?

From our point of view, we create one MethodHandle to call your WebSocket endpoint -- nothing fancy.
However, at runtime invoking this MethodHandle apparently creates a different lambda form for every invocation (not sure about this though), which causes the many tiny green towers, and possibly fill up the MetaSpace with lambda forms.
If that is the case, it seems like a problem in the implementation of MethodHandles, and as such an OpenJDK bug.

Let us know the result of your findings. Meanwhile we will investigate as well.

@lachlan-roberts
Copy link
Contributor

@SerCeMan @sbordet I have done some testing and understand a bit more why this is happening.

For each WebSocket endpoint we get MethodHandles for the relevant methods onOpen, onMessage etc. Then for each new WebSocket connection we bind these MethodHandles to things like the endpoint and session instances. This creates new MethodHandles for each connection which shows up as a different method call on the flamegraph. I don't think this is a performance issue, just an issue on how it is displayed on the flamegraph.
See https://github.com/lachlan-roberts/MethodHandlesExample/blob/master/src/test/java/test/MethodHandleExample.java

Although I did not see much time spent in the Invokers.checkCustomized() branch.
@SerCeMan what is the signature of the onMessage() method you are using for your WebSocket endpoint?

@sbordet
Copy link
Contributor

sbordet commented Jun 2, 2021

Then for each new WebSocket connection we bind these MethodHandles to things like the endpoint and session instances.

Do you need to do this? I am referring to the binding, because if we have 1M connections we would have 1M different MethodHandles, while I'm assuming that if you don't bind, we would only have 1 MethodHandle, no?

@lachlan-roberts
Copy link
Contributor

Do you need to do this? I am referring to the binding, because if we have 1M connections we would have 1M different MethodHandles, while I'm assuming that if you don't bind, we would only have 1 MethodHandle, no?

@sbordet We don't need to bind the MethodHandle, you can invoke the original MethodHandle providing all the arguments every time. We would just need to remember the endpoint and session everywhere we invoke the MethodHandle.

In the project I linked there is a benchmark comparing the bind to only using 1 MethodHandle and providing all arguments each time. Interestingly the benchmark result showed the case where we use methodHandle.bindTo(endpoint) as about 10 times faster. So we might not want to change the code to use only 1 MethodHandle if it is going to impact performance.

@gregw
Copy link
Contributor

gregw commented Jun 3, 2021

Perhaps we should raise an issue on whatever software is producing the flame graph. Binding to a method handle is a normal thing to do and as @lachlan-roberts benchmark show, it is the right thing to do.
The tools should be smart enough to know that bound MethodHandles are the same class, differing only by data.

@gregw
Copy link
Contributor

gregw commented Jun 3, 2021

@lachlan-roberts can you paste the benchmark report into a comment on this issue.

@SerCeMan
Copy link
Contributor Author

SerCeMan commented Jun 3, 2021

Hey, folks! Sorry for the delayed response. I'll try to prepare answers later today. I'm still trying to reproduce the issue in the test environment - no luck yet, but I suspect that it might be related to Shenandoah GC that we're using. I'm still working on a test case that can reproduce it.

@lachlan-roberts
Copy link
Contributor

Benchmark Results:

Benchmark                        (STRATEGY)   Mode  Cnt     Score      Error  Units
MethodHandlesBenchmark.test    BOUND_INVOKE  thrpt    3  8055.479 ± 3022.922  ops/s
MethodHandlesBenchmark.test  UNBOUND_INVOKE  thrpt    3   667.597 ± 3597.695  ops/s

@SerCeMan
Copy link
Contributor Author

SerCeMan commented Jun 7, 2021

Sorry for the delay. To answer the questions above,

Are you using annotated WebSocket endpoints, or you implement WebSocketListener?

We implement WebSocketListener.

I'm assuming your messages all have different sizes, or is your flamegraph the result of some load test with fixed size messages?

Yes, messages have a wide range of sizes from a few bytes up to 5 Mb.

While we do see the tiny green towers, they account for less than 1% of CPU time. What is your figure of CPU time?

Because most of the towers are different lambda forms, their CPU consumption varies. However, the part that was unexpected was the fact Invokers.checkCustomized is responsible for 30% of the total CPU spent in message processing.
Screen Shot 2021-06-07 at 6 42 33 pm

I'll get back to you once I'm able to reproduce the issue in some test environment.

@SerCeMan
Copy link
Contributor Author

SerCeMan commented Jun 7, 2021

Hey, @lachlan-roberts! Regarding the benchmarks, please correct me if I'm wrong but it seems that it's possible to make the methodhandle stored in a final variable which makes the results equal considering error, lachlan-roberts/MethodHandlesExample#1.

Benchmark                        (STRATEGY)   Mode  Cnt     Score      Error  Units
MethodHandlesBenchmark.test    BOUND_INVOKE  thrpt    3  5909.462 ± 2128.965  ops/s
MethodHandlesBenchmark.test  UNBOUND_INVOKE  thrpt    3  6540.706 ± 2714.425  ops/s

@lachlan-roberts
Copy link
Contributor

I can confirm that making the MethodHandle a final variable equalizes the performance difference that I was seeing in the benchmarks.

@joakime
Copy link
Contributor

joakime commented Nov 2, 2021

@lachlan-roberts is this still reproducible on current Jetty 10.0.x and/or 11.0.x HEAD?

@SerCeMan
Copy link
Contributor Author

SerCeMan commented Nov 2, 2021

I haven't been able to reproduce it in a synthetic environment, stumbled across #6696, and now that 10.0.7, I'm planning to try to upgrade again. Because I can't reproduce it in a synthetic environment, I can close the issue and re-open it with additional information if it manifests again.

@lachlan-roberts
Copy link
Contributor

The separate spikes on the flamegraph are reproducible, but I think it is really just an interaction between the profiler and MethodHandles and is unlikely to be causing any performance degradation.

I am still planning to do a PR to use only final unbound MethodHandles to see if it is any better, but I haven't gotten around to doing it yet. So I would leave this issue open for this reason.

There have been a number of PRs to improve performance in Jetty 10 since 10.0.3, so if you update it may be that you no longer experience this performance regression. For example PR #6635 was designed to reduce allocation of buffers for whole message aggregation and also reduce the amount of data copies.

@SerCeMan
Copy link
Contributor Author

SerCeMan commented Mar 8, 2022

Hey, @lachlan-roberts and the team! Would you accept a PR that replaces a large number of method handles with a single one considering that the benchmarks above show no negative performance impact? After attempting to upgrade to 10.0.8, we still see a large amount of CPU time spent resolving method handles.

Screen Shot 2022-03-08 at 12 52 09 pm

@lachlan-roberts
Copy link
Contributor

@SerCeMan I think this could be difficult to implement, even more so if you are not already familiar with the Jetty WebSocket implementation. I will not have time to attempt this for a few weeks.

Can you attach the full flamegraph file instead of just the screenshot? Also if you have some reproducer code which can reproduce this checkCustomized branch, it would be good to see that as well.

lachlan-roberts added a commit that referenced this issue Jun 3, 2022
Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
lachlan-roberts added a commit that referenced this issue Jun 6, 2022
Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
lachlan-roberts added a commit that referenced this issue Jun 6, 2022
… algorithm

Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
lachlan-roberts added a commit that referenced this issue Jun 8, 2022
Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
@github-actions
Copy link

github-actions bot commented Mar 9, 2023

This issue has been automatically marked as stale because it has been a
full year without activity. It will be closed if no further activity occurs.
Thank you for your contributions.

@github-actions github-actions bot added the Stale For auto-closed stale issues and pull requests label Mar 9, 2023
@sbordet sbordet removed the Stale For auto-closed stale issues and pull requests label Mar 13, 2023
@gregw
Copy link
Contributor

gregw commented Jul 5, 2023

@SerCeMan What is the current status of this issue for you? Are you still seeing high CPU? Our PR to address this has significant performance impact, so it was never merged.

@SerCeMan
Copy link
Contributor Author

SerCeMan commented Jul 5, 2023

Thanks, @gregw! Apologies for not providing an update sooner. The issue related to the CPU usage linked to resolving method handles (in the yellow part of the flame graph) seemed to be related to a specific combination of GC settings (ShenandoahGC) and the JVM version we were using at the time (version 13). A series of JDK upgrades, although I'm unsure of the exact version, likely the transition to version 17, managed to resolve it.

There is still an issue with not being able to use async-profiler, but it's more of a nice-to-have observability featuer considering that it is theoretically possible to employ the async-profiler with some extra pre/post stack processing.

@gregw
Copy link
Contributor

gregw commented Jul 5, 2023

@lachlan-roberts So it seams this is not needed so much now....
However, it is still non-optimal that the AsyncProfiler makes many man towers instead of one. Did you get any joy raising this with the flamegraph folks? Is there some post processing that can be applied to merge those peaks?

@lachlan-roberts
Copy link
Contributor

@gregw we did open an issue with them at some stage but they said it was the expected behaviour with method handles and flagged it as not an issue.

Copy link

This issue has been automatically marked as stale because it has been a
full year without activity. It will be closed if no further activity occurs.
Thank you for your contributions.

@github-actions github-actions bot added the Stale For auto-closed stale issues and pull requests label Jul 14, 2024
Copy link

This issue has been closed due to it having no activity.

lachlan-roberts added a commit that referenced this issue Aug 21, 2024
Signed-off-by: Lachlan Roberts <lachlan.p.roberts@gmail.com>
lachlan-roberts added a commit that referenced this issue Aug 26, 2024
Issue #6328 - avoid binding WebSocket MethodHandles
Copy link

This issue has been closed due to it having no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale For auto-closed stale issues and pull requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants