Skip to content

Conversation

@JonasNorlinder
Copy link
Member

@JonasNorlinder JonasNorlinder commented Dec 15, 2025

GCTraceCPUTime is used by Serial, Parallel and G1 to estimate GC CPU time spent during safepoints. It estimates CPU time by sampling the CPU time for entire process by using os::getTimesSecs.

This will not be correct if native code from e.g. JNI is used and is incurring a high load on the machine as native code is not paused during safepoints.

A minimal example:

public static void main(String[] args) {
    // Initial GC that clean up start-up garbage etc.
    System.gc();

    for (int i = 1; i <= 128; i++) {
        addLoad(); // Creates 128 threads, that spins in a loop
        try {
            Thread.sleep(2000);
        } catch (InterruptedException e) {}
        System.gc();
    }
    System.exit(0);
}

where addLoad() is a native call to a C++ program that create 128 pthreads that busy-spin forever with the largest nice value allowed.

In such scenario we can observe using G1

[0.026s][info][gc,cpu] GC(0) User=0.00s Sys=0.00s Real=0.01s
[2.103s][info][gc,cpu] GC(1) User=0.89s Sys=0.00s Real=0.03s
[4.129s][info][gc,cpu] GC(2) User=0.51s Sys=0.00s Real=0.01s
[6.195s][info][gc,cpu] GC(3) User=1.72s Sys=0.00s Real=0.06s
[8.244s][info][gc,cpu] GC(4) User=1.46s Sys=0.00s Real=0.05s
...
[201.191s][info][gc,cpu] GC(76) User=37.91s Sys=0.00s Real=1.19s
[205.064s][info][gc,cpu] GC(77) User=52.44s Sys=0.00s Real=1.64s
[208.068s][info][gc,cpu] GC(78) User=23.49s Sys=0.00s Real=0.74s
[211.795s][info][gc,cpu] GC(79) User=46.96s Sys=0.00s Real=1.47s

Since the Java code does not do anything the time should be close to 0.

Implementation Comment

Total CPU time for GC operations on the VM thread is obtained by cumulatively adding to a counter. Since system time is calculated (on Linux) with total - user is value contains a small error delta. If we continuously increment a value this error would grow to an unacceptable level. Therefore, the implementation will rely on the fact that no other VM operation may run during GC safepoint, making it safe to query the VM thread for it's detailed CPU time at the start and end of GC pause.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8361716: GCTraceCPUTime may report incorrect times during high load from JNI code (Bug - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28824/head:pull/28824
$ git checkout pull/28824

Update a local copy of the PR:
$ git checkout pull/28824
$ git pull https://git.openjdk.org/jdk.git pull/28824/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28824

View PR using the GUI difftool:
$ git pr show -t 28824

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28824.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 15, 2025

👋 Welcome back jnorlinder! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@JonasNorlinder
Copy link
Member Author

/label add hotspot-gc hotspot-runtime

@openjdk
Copy link

openjdk bot commented Dec 15, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added hotspot-gc hotspot-gc-dev@openjdk.org hotspot-runtime hotspot-runtime-dev@openjdk.org labels Dec 15, 2025
@openjdk
Copy link

openjdk bot commented Dec 15, 2025

@JonasNorlinder
The hotspot-gc label was successfully added.

The hotspot-runtime label was successfully added.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Dec 15, 2025
@openjdk
Copy link

openjdk bot commented Dec 15, 2025

@JonasNorlinder The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@JonasNorlinder
Copy link
Member Author

/label remove hotspot

@openjdk openjdk bot removed the hotspot hotspot-dev@openjdk.org label Dec 15, 2025
@openjdk
Copy link

openjdk bot commented Dec 15, 2025

@JonasNorlinder
The hotspot label was successfully removed.

@JonasNorlinder JonasNorlinder marked this pull request as ready for review December 15, 2025 14:48
@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 15, 2025
@mlbridge
Copy link

mlbridge bot commented Dec 15, 2025

@mlbridge
Copy link

mlbridge bot commented Dec 15, 2025

Mailing list message from Kirk Pepperdine on hotspot-gc-dev:

Hi Jonas,

I believe the assumption of ?close to 0? is based on no other processes running on that machine, not just native JNI code. In a ?clean? system I have seen that these numbers are correct enough. The interesting thing about these numbers is the tell that they give you. For example, it appears as if you are running on a 32 core machine. Additionally, there are system level disturbances that are seen with an analysis of these measures. I?m sure enabling this analysis wasn?t intentional but it can be quite useful.

Kind regards,
Kirk

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1515 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20251215/3c29c5ce/smime.p7s>

@@ -1,5 +1,5 @@
/*
* Copyright (c) 1999, 2024, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 1999, 2025, Oracle and/or its affiliates. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright change seems to be the only change in this file, and should be reverted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File was recently edited but didn't update copyright year. I can file a separate JBS and PR for that line if you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay.

Comment on lines 170 to 173
struct CPUTime {
jlong user;
jlong system;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw, jlong is a "Java" long, so maybe some C type should be used here. I am aware that the OS interface traditionally uses it, but changes need to start somewhere. However the runtime team may have additional comments about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think such change should start at the source i.e. os::thread_cpu_time, but will await what runtime team says...

@mlbridge
Copy link

mlbridge bot commented Dec 16, 2025

Mailing list message from Jonas Norlinder on hotspot-gc-dev:

Hi Kirk,

That's an interesting point you bring up. While I agree that the current implementation may leak information that is useful for detecting system saturation, I'm not sure that means we can keep this implementation. The semantic contract of the [gc, cpu] log tag is to report GC CPU time. By including process wide noise we violate that contract. We will still report system and user time just more accurately so.

For diagnosing system-level distances we have JFR (jdk.CPUload) and JMX (OperatingSystemMXBean). We should probably rely on these tools rather than using GC logs to carry a noisy signal.

Best,
Jonas

On 2025-12-15, 22:51, "Kirk Pepperdine" <kirk.pepperdine at gmail.com> wrote:

Hi Jonas,

I believe the assumption of ?close to 0? is based on no other processes running on that machine, not just native JNI code. In a ?clean? system I have seen that these numbers are correct enough. The interesting thing about these numbers is the tell that they give you. For example, it appears as if you are running on a 32 core machine. Additionally, there are system level disturbances that are seen with an analysis of these measures. I?m sure enabling this analysis wasn?t intentional but it can be quite useful.

Kind regards,
Kirk

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20251216/c860cd5e/attachment.htm>

@JonasNorlinder
Copy link
Member Author

Thanks for the review Thomas! Also fixed some comments from @kstefanj.

Copy link
Contributor

@kstefanj kstefanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, some more comments.

Comment on lines 170 to 175
struct cpu_time {
jlong user;
jlong system;
};
typedef struct cpu_time cpu_time_t;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked a bit closer at this now, I could make sense to use a more common Hotspot type here since this is nothing that we expect the JDK will use. I think that is the main reason we use jlong in os::thread_cpu_time().

The new detailed_thread_cpu_time() is a Hotspot API and could then use Hotspot types, like uint64_t. But then we likely want to change this in the users of this class as well, for example DetailedCPUTimeThreadClosure, and I'm not sure how big the fan-out would be. Maybe we can change the type as a follow-up if more people agree with this.

@JonasNorlinder
Copy link
Member Author

Thanks for the review Stefan!

Copy link
Contributor

@kstefanj kstefanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of more comments.

Comment on lines 112 to 116
CPUTime CPUTimeUsage::GC::detailed_stringdedup() {
if (UseStringDeduplication) {
return detailed_thread_cpu_time_or_zero((Thread*)StringDedup::_processor->_thread);
}
return { 0, 0 };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have turned this around, but that's just taste, so skip if you prefer as is:

Suggested change
CPUTime CPUTimeUsage::GC::detailed_stringdedup() {
if (UseStringDeduplication) {
return detailed_thread_cpu_time_or_zero((Thread*)StringDedup::_processor->_thread);
}
return { 0, 0 };
CPUTime CPUTimeUsage::GC::detailed_stringdedup() {
if (!UseStringDeduplication) {
return {0, 0};
}
return detailed_thread_cpu_time_or_zero((Thread*)StringDedup::_processor->_thread);

@@ -1,5 +1,5 @@
/*
* Copyright (c) 1999, 2024, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 1999, 2025, Oracle and/or its affiliates. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay.

Comment on lines 170 to 173
class CPUTime {
public:
int64_t user;
int64_t system;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If everything is public in this class, consider using a struct instead.

Also per style guidelines, even public class/struct members should use the underscore prefix.

@openjdk openjdk bot removed the rfr Pull request is ready for review label Dec 18, 2025
@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 18, 2025
@JonasNorlinder
Copy link
Member Author

JonasNorlinder commented Dec 18, 2025

Huge thanks to @kstefanj that found that values could appear as <0 where GC runs very frequent. I tracked the error down due to the technical limitations of how we can get system time on Linux for a thread. Since it requires two calls and taking the difference (total - user), there is a risk that these are out of sync. This can be resolved by using getrusage which values are in sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-gc hotspot-gc-dev@openjdk.org hotspot-runtime hotspot-runtime-dev@openjdk.org rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

3 participants