-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8361716 : GCTraceCPUTime may report incorrect times during high load from JNI code #28824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
👋 Welcome back jnorlinder! A progress list of the required criteria for merging this PR into |
|
/label add hotspot-gc hotspot-runtime |
|
❗ This change is not yet ready to be integrated. |
|
@JonasNorlinder The |
|
@JonasNorlinder The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
|
/label remove hotspot |
|
@JonasNorlinder |
Webrevs
|
|
Mailing list message from Kirk Pepperdine on hotspot-gc-dev: Hi Jonas, I believe the assumption of ?close to 0? is based on no other processes running on that machine, not just native JNI code. In a ?clean? system I have seen that these numbers are correct enough. The interesting thing about these numbers is the tell that they give you. For example, it appears as if you are running on a 32 core machine. Additionally, there are system level disturbances that are seen with an analysis of these measures. I?m sure enabling this analysis wasn?t intentional but it can be quite useful. Kind regards, -------------- next part -------------- |
| @@ -1,5 +1,5 @@ | |||
| /* | |||
| * Copyright (c) 1999, 2024, Oracle and/or its affiliates. All rights reserved. | |||
| * Copyright (c) 1999, 2025, Oracle and/or its affiliates. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copyright change seems to be the only change in this file, and should be reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File was recently edited but didn't update copyright year. I can file a separate JBS and PR for that line if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay.
src/hotspot/share/runtime/os.hpp
Outdated
| struct CPUTime { | ||
| jlong user; | ||
| jlong system; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fwiw, jlong is a "Java" long, so maybe some C type should be used here. I am aware that the OS interface traditionally uses it, but changes need to start somewhere. However the runtime team may have additional comments about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think such change should start at the source i.e. os::thread_cpu_time, but will await what runtime team says...
|
Mailing list message from Jonas Norlinder on hotspot-gc-dev: Hi Kirk, That's an interesting point you bring up. While I agree that the current implementation may leak information that is useful for detecting system saturation, I'm not sure that means we can keep this implementation. The semantic contract of the [gc, cpu] log tag is to report GC CPU time. By including process wide noise we violate that contract. We will still report system and user time just more accurately so. For diagnosing system-level distances we have JFR (jdk.CPUload) and JMX (OperatingSystemMXBean). We should probably rely on these tools rather than using GC logs to carry a noisy signal. Best, On 2025-12-15, 22:51, "Kirk Pepperdine" <kirk.pepperdine at gmail.com> wrote: Hi Jonas, I believe the assumption of ?close to 0? is based on no other processes running on that machine, not just native JNI code. In a ?clean? system I have seen that these numbers are correct enough. The interesting thing about these numbers is the tell that they give you. For example, it appears as if you are running on a 32 core machine. Additionally, there are system level disturbances that are seen with an analysis of these measures. I?m sure enabling this analysis wasn?t intentional but it can be quite useful. Kind regards, -------------- next part -------------- |
|
Thanks for the review Thomas! Also fixed some comments from @kstefanj. |
kstefanj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general, some more comments.
src/hotspot/share/runtime/os.hpp
Outdated
| struct cpu_time { | ||
| jlong user; | ||
| jlong system; | ||
| }; | ||
| typedef struct cpu_time cpu_time_t; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked a bit closer at this now, I could make sense to use a more common Hotspot type here since this is nothing that we expect the JDK will use. I think that is the main reason we use jlong in os::thread_cpu_time().
The new detailed_thread_cpu_time() is a Hotspot API and could then use Hotspot types, like uint64_t. But then we likely want to change this in the users of this class as well, for example DetailedCPUTimeThreadClosure, and I'm not sure how big the fan-out would be. Maybe we can change the type as a follow-up if more people agree with this.
|
Thanks for the review Stefan! |
kstefanj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of more comments.
| CPUTime CPUTimeUsage::GC::detailed_stringdedup() { | ||
| if (UseStringDeduplication) { | ||
| return detailed_thread_cpu_time_or_zero((Thread*)StringDedup::_processor->_thread); | ||
| } | ||
| return { 0, 0 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have turned this around, but that's just taste, so skip if you prefer as is:
| CPUTime CPUTimeUsage::GC::detailed_stringdedup() { | |
| if (UseStringDeduplication) { | |
| return detailed_thread_cpu_time_or_zero((Thread*)StringDedup::_processor->_thread); | |
| } | |
| return { 0, 0 }; | |
| CPUTime CPUTimeUsage::GC::detailed_stringdedup() { | |
| if (!UseStringDeduplication) { | |
| return {0, 0}; | |
| } | |
| return detailed_thread_cpu_time_or_zero((Thread*)StringDedup::_processor->_thread); |
| @@ -1,5 +1,5 @@ | |||
| /* | |||
| * Copyright (c) 1999, 2024, Oracle and/or its affiliates. All rights reserved. | |||
| * Copyright (c) 1999, 2025, Oracle and/or its affiliates. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay.
src/hotspot/share/runtime/os.hpp
Outdated
| class CPUTime { | ||
| public: | ||
| int64_t user; | ||
| int64_t system; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If everything is public in this class, consider using a struct instead.
Also per style guidelines, even public class/struct members should use the underscore prefix.
|
Huge thanks to @kstefanj that found that values could appear as <0 where GC runs very frequent. I tracked the error down due to the technical limitations of how we can get system time on Linux for a thread. Since it requires two calls and taking the difference (total - user), there is a risk that these are out of sync. This can be resolved by using getrusage which values are in sync. |
GCTraceCPUTime is used by Serial, Parallel and G1 to estimate GC CPU time spent during safepoints. It estimates CPU time by sampling the CPU time for entire process by using
os::getTimesSecs.This will not be correct if native code from e.g. JNI is used and is incurring a high load on the machine as native code is not paused during safepoints.
A minimal example:
where addLoad() is a native call to a C++ program that create 128 pthreads that busy-spin forever with the largest nice value allowed.
In such scenario we can observe using G1
Since the Java code does not do anything the time should be close to 0.
Implementation Comment
Total CPU time for GC operations on the VM thread is obtained by cumulatively adding to a counter. Since system time is calculated (on Linux) with
total - useris value contains a small error delta. If we continuously increment a value this error would grow to an unacceptable level. Therefore, the implementation will rely on the fact that no other VM operation may run during GC safepoint, making it safe to query the VM thread for it's detailed CPU time at the start and end of GC pause.Progress
Issue
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28824/head:pull/28824$ git checkout pull/28824Update a local copy of the PR:
$ git checkout pull/28824$ git pull https://git.openjdk.org/jdk.git pull/28824/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 28824View PR using the GUI difftool:
$ git pr show -t 28824Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28824.diff
Using Webrev
Link to Webrev Comment