Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8267926: AsyncLogGtest.java fails on assert with: decorator was not part of the decorator set specified at creation. #4257

Closed
wants to merge 10 commits into from

Conversation

navyxliu
Copy link
Member

@navyxliu navyxliu commented May 28, 2021

The root cause of the intermittent failure is that _decorators in LogDecorations
may be inconsistent with LogOutput::_decorators. It could happen when gtest disables a
Log output via set_log_config(TestLogFileName, "all=off").

Since we copy the entire logDecorations, it's reasonable to copy _decorator as well.
LogDecorator is a bitmask of uint. It's even smaller than a reference on LP64 platforms.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8267926: AsyncLogGtest.java fails on assert with: decorator was not part of the decorator set specified at creation.

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4257/head:pull/4257
$ git checkout pull/4257

Update a local copy of the PR:
$ git checkout pull/4257
$ git pull https://git.openjdk.java.net/jdk pull/4257/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 4257

View PR using the GUI difftool:
$ git pr show -t 4257

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4257.diff

…art of the decorator set specified at creation.
@bridgekeeper
Copy link

bridgekeeper bot commented May 28, 2021

👋 Welcome back xliu! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label May 28, 2021
@openjdk
Copy link

openjdk bot commented May 28, 2021

@navyxliu The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label May 28, 2021
@mlbridge
Copy link

mlbridge bot commented May 28, 2021

@navyxliu
Copy link
Member Author

Test:
I simulate the race condition in the JBS. it should work.

I personally inspect all gtests, only this line make _decorators bigger than before.
We haven't added async logging support for stdout/stderr yet. Therefore, it's safe, so far.

./logging/test_logConfiguration.cpp:222:  set_log_config("stderr", "all=off", _all_decorators);

I will consider this case when I enable stdout/stderr of async logging.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is debug only this seems a reasonable workaround.

Thanks,
David

@openjdk
Copy link

openjdk bot commented May 30, 2021

@navyxliu This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8267926: AsyncLogGtest.java fails on assert with: decorator was not part of the decorator set specified at creation.

Reviewed-by: dholmes, phh

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 86 new commits pushed to the master branch:

  • 9fc914b: 8204686: Dynamic parallel reference processing support for Parallel GC
  • 908aca2: 8262891: Compiler implementation for Pattern Matching for switch (Preview)
  • 3e48244: 8268301: Closed test: compiler/c2/6371167/Test.java fails after JDK-8267904
  • 204b492: 8267703: runtime/cds/appcds/cacheObject/HeapFragmentationTest.java crashed with OutOfMemory
  • 2aeeeb4: 8268279: gc/shenandoah/compiler/TestLinkToNativeRBP.java fails after LibraryLookup is gone
  • b05fa02: 8267904: C2 crash when compile negative Arrays.copyOf length after loop
  • 95ddf7d: 8267839: trivial mem leak in numa
  • 52d88ee: 8268292: compiler/intrinsics/VectorizedMismatchTest.java fails with release VMs
  • 042f0bd: 8256465: [macos] Java frame and dialog presented full screen freeze application
  • 8abf36c: 8268289: build failure due to missing signed flag in x86 evcmpb instruction
  • ... and 76 more: https://git.openjdk.java.net/jdk/compare/7530c00b33aac8918841dbae4d928956b60c261f...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@dholmes-ora, @phohensee) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 30, 2021
…art of the decorator set specified at creation.
@navyxliu
Copy link
Member Author

Hi, @dholmes-ora,

I recently realize how flexible unified logging is(JDK-8267952). By design, LogConfiguration::configure_output is MT-safe. Clearly, authors of unified logging put a lot of thoughts on it. I try to write down my understanding in the form of comments. Async logging support can be a natural extension of current synchronous logging. All we need to do is to eliminate two hazards I mentioned.

I also add a concurrent test to prove it works. By reasoning how it works and the test, I discover two places which could cause race condition. they are also fixed in this patch.

  1. the creation of LogDecorations needs to increase reader counter first.
  2. need a storestore barrier to guarantee it works on weak consistency machines.

Could you verify this patch in Tier4 tests?
Could you also review this patch again?

thanks,
--lx

@navyxliu
Copy link
Member Author

Now the failure is irrelevant. It can't find proper toolchains on windows aarch64.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments below - mostly typos - but I think the main race still exists.

I need more time to try and digest the other changes and understand the existing MT-safety aspects.

Meanwhile I've submitted for tiers 1-4 testing.

Thanks,
David

@@ -102,7 +102,7 @@ class AsyncLogMapIterator {
using none = LogTagSetMapping<LogTag::__NO_TAG>;

if (*counter > 0) {
LogDecorations decorations(LogLevel::Warning, none::tagset(), output->decorators());
LogDecorations decorations(LogLevel::Warning, none::tagset(), LogDecorators::All);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks odd to use LogDecorators::All - what exactly will that produce?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the out-of-band message, it just shows that "xyz messages dropped due to async logging."

Unlike normal log messages, this ad-hoc message is enqueued when buffer is flushing. The synchronization I described in LogConfiguration::configure_output only protects enqueuing log messages. This message is not under protection.Without LogDecorators::All, it will fail 'gtest:LogConfigurationTest.reconfigure_decorators_MT*' in async log mode.

// if setting has changed. It guarantees that all logs either synchronous writing or enqueuing to the async buffer
// see the new tags and decorators. It's worth noting that the synchronization happens even level doesn't change.
//
// LogDecorator is a set of decorators represented in a uint. sizeof(uinit) is not greater than a machine word,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: uinit

// After updating output's decorators, it's still safe to shrink all decorators of tagsets.
//
// There are 2 hazards in async logging. A flush operation guarantees to all pending messages in buffer are written
// before returning. Therefore, the hardards won't appear. It's a nop if async logging is not set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: hardards

Comment on lines 278 to 279
AsyncLogWriter::flush();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as this completes, new logging requests could have been enqueued, thus restoring the hazard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true. Yes, new logsites could have been enqueued, but they must see the new tags and the new decorations. please note ts->set_output_level(output, level) implies a wait_until_no_readers() if anything does change.

One trick is I treat asynclog enqueuing as same as synclog writing. wait_until_no_readers also takes effect for async logging. It guarantees that all enqueuing async logs are done before the reader counter reset back to 0. If there're ongoing log requests with old tags or decorations, they all have enqueued before wait_until_no_readers() returns. that's why one flush() is enough for the two hazards.

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label May 31, 2021
@dholmes-ora
Copy link
Member

I put the latest patch through our tier 1-4 testing and it crashed in tier 3 on Linux-Aarch64:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x0000ffff40820720, pid=3972473, tid=3972481

JRE version: Java(TM) SE Runtime Environment (17.0) (fastdebug build 17-internal+0-LTS-2021-05-31-2211099.david.holmes.jdk-dev4.git)

Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 17-internal+0-LTS-2021-05-31-2211099.david.holmes.jdk-dev4.git, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)

Problematic frame:

C [libc.so.6+0x60720] flockfile+0x0

Core dump will be written. Default location: Core dumps may be processed with "/opt/core.sh %p" (or dumping to /opt/mach5/mesos/work_dir/slaves/a4f8fba9-f017-4328-b286-c66b6a97143d-S808/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/42993029-a18a-48c4-9a4d-108aee7b3811/runs/a38508fe-5d43-48ad-b70a-61f3e712fcd8/testoutput/test-support/jtreg_open_test_hotspot_jtreg_hotspot_misc/scratch/0/core.3972473)

If you would like to submit a bug report, please visit:

https://bugreport.java.com/bugreport/crash.jsp

--------------- S U M M A R Y ------------

Command Line: -XX:+ExecutingUnitTests -Xlog:async

Host: AArch64, 6 cores, 46G, Oracle Linux Server release 8.3
Time: Mon May 31 22:37:04 2021 UTC elapsed time: 3.758723 seconds (0d 0h 0m 3s)

--------------- T H R E A D ---------------

Current thread (0x0000aaadf32b1e10): Thread "AsyncLog Thread" [stack: 0x0000fffefdaa0000,0x0000fffefdca0000] [id=3972481]

Stack: [0x0000fffefdaa0000,0x0000fffefdca0000], sp=0x0000fffefdc9e510, free space=2041k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libc.so.6+0x60720] flockfile+0x0
V [libjvm.so+0x1508058] LogFileOutput::write_blocking(LogDecorations const&, char const*)+0x38
V [libjvm.so+0x1502764] AsyncLogWriter::run()+0x2c0
V [libjvm.so+0x1a49d08] Thread::call_run()+0xf8
V [libjvm.so+0x1766f18] thread_native_entry(Thread*)+0x108
C [libpthread.so.0+0x76e0] start_thread+0x198

@dcubed-ojdk
Copy link
Member

It looks like this failure is taking a bit of work to resolve. I've gone ahead and
ProblemListed the test. At this point we have 20 sightings in Tier[3-7]. Since
this failure is intermittent, I would want to see Tier[5-7] testing done before
the fix is integrated.

@navyxliu navyxliu marked this pull request as draft June 1, 2021 22:41
@openjdk openjdk bot removed the rfr Pull request is ready for review label Jun 1, 2021
@mlbridge
Copy link

mlbridge bot commented Jun 2, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 1/06/2021 10:26 am, Xin Liu wrote:

On Mon, 31 May 2021 22:14:46 GMT, David Holmes <dholmes at openjdk.org> wrote:

Xin Liu has updated the pull request incrementally with one additional commit since the last revision:

Increase reader's counter before creating LogDecoration.

src/hotspot/share/logging/logAsyncWriter.cpp line 105:

103:
104: if (*counter > 0) {
105: LogDecorations decorations(LogLevel::Warning, none::tagset(), LogDecorators::All);

This looks odd to use LogDecorators::All - what exactly will that produce?

This is the out-of-band message, it just shows that "xyz messages dropped due to async logging."

Unlike normal log messages, this ad-hoc message is enqueued when buffer is flushing. The synchronization I described in `LogConfiguration::configure_output` only protects enqueuing log messages. This message is not under protection.Without LogDecorators::All, it will fail 'gtest:LogConfigurationTest.reconfigure_decorators_MT*' in async log mode.

I understand the need to not use the true decorators, but using All
sounds like it will produce a log line with an excessive number of
decorators applied:

Available log decorators:
time (t), utctime (utc), uptime (u), timemillis (tm), uptimemillis
(um), timenanos (tn), uptimenanos (un), hostname (hn), pid (p), tid
(ti), level (l), tags (tg)

will all of those appear on this log line?

Thanks,
David

@mlbridge
Copy link

mlbridge bot commented Jun 2, 2021

Mailing list message from Liu, Xin on hotspot-runtime-dev:

Hi, David,

No, logoutput won't show all decorators. It just materialize
logDecorations for All. This only happens when lost logs happen.

It's very tricky here, but it's essential to reason how the lockless
synchronization works for LogDecorations in
LogConfiguration::configure_output.

A LogDecorators a SET of decorators in bitmask. A LogDecorations
instance represents a materialized LogDecorators.

There are 2 different LogDecorations instances coexist.?
Set 1: LogOutput::_decorators, which is used to select which decorations
are output. LogFileStreamOutput::write_decorations()
Set 2: LogTagSet::_decorators, which is used to materialize LogDecorations.

In any give time, set 1 is a 'subset' of set 2. This guarantees that
write_decorations() always outputs defined decorations.

Why configure_output() is MT-safe if users allow to change decorations
dynamically? Let's say the new decorations user provided is set N.

There are 4 steps in configure_output().
1) ts->update_decorators(decorators); // set 2 = union(set 2, set N)
2) ts->set_output_level(output, level); // wait_until_no_readers()
After it, ts->log() must see the new set 2 in synchronous logging.
Some async log message with old decorations may be in the buffer. This
is the Hazard #1.
3) output->set_decorators(decorators); // set 1 = set N in atomic.
4) ts->update_decorators(LogDecorators::None); // set 2 = intersect(set
2, set N), essentially shrink the set 2 to set N.

This mechanism is intrigue. That's why I try to comment it out. I also
discover a race condition which only happens in preemptive context
switch. Please help me to review the change in LogTagSet::log().

To make current mechanism support async logging, I insert a
AsyncLogWriter::flush() before step 3). Yes, after flush() returns,
logsites can still enqueues messages in parallel, but they must seen new
set 2! It's not hazard.

thanks,
--lx

On 5/31/21 6:27 PM, David Holmes wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On 1/06/2021 10:26 am, Xin Liu wrote:

On Mon, 31 May 2021 22:14:46 GMT, David Holmes <dholmes at openjdk.org> wrote:

Xin Liu has updated the pull request incrementally with one additional commit since the last revision:

Increase reader's counter before creating LogDecoration.

src/hotspot/share/logging/logAsyncWriter.cpp line 105:

103:
104: if (*counter > 0) {
105: LogDecorations decorations(LogLevel::Warning, none::tagset(), LogDecorators::All);

This looks odd to use LogDecorators::All - what exactly will that produce?

This is the out-of-band message, it just shows that "xyz messages dropped due to async logging."

Unlike normal log messages, this ad-hoc message is enqueued when buffer is flushing. The synchronization I described in `LogConfiguration::configure_output` only protects enqueuing log messages. This message is not under protection.Without LogDecorators::All, it will fail 'gtest:LogConfigurationTest.reconfigure_decorators_MT*' in async log mode.

I understand the need to not use the true decorators, but using All
sounds like it will produce a log line with an excessive number of
decorators applied:

Available log decorators:
time (t), utctime (utc), uptime (u), timemillis (tm), uptimemillis
(um), timenanos (tn), uptimenanos (un), hostname (hn), pid (p), tid
(ti), level (l), tags (tg)

will all of those appear on this log line?

Thanks,
David

@mlbridge
Copy link

mlbridge bot commented Jun 2, 2021

Mailing list message from Liu Xin on hotspot-runtime-dev:

hi, David,

This is awkward. I think I changed too many in a patch.

Do you think it is a good idea that I revert to
https://openjdk.github.io/cr/?repo=jdk&pr=4257&range=00
It is a trivial bugfix. Does it solve tier1~4 on aarch64?

If it does, we can know that this new crash is caused by the change in my
new patch.
I would recall it and rework it on aarch64.

It seems that it crashed while executing gtest in async mode. Could you
tell me which gtest it was executing?
"Command Line: -XX:+ExecutingUnitTests -Xlog:async"

thanks,
--lx

On Mon, May 31, 2021 at 10:13 PM David Holmes <dholmes at openjdk.java.net>
wrote:

@mlbridge
Copy link

mlbridge bot commented Jun 2, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

Hi Xin,

On 1/06/2021 3:48 pm, Liu Xin wrote:

hi, David,

This is awkward. I think I changed too many in a patch.

Do you think it is a good idea that I revert to
https://openjdk.github.io/cr/?repo=jdk&pr=4257&range=00
<https://openjdk.github.io/cr/?repo=jdk&pr=4257&range=00>
It is a trivial bugfix.? Does it solve tier1~4 on aarch64?

I didn't test just that patch, but will try to do so.

It might be best to revert to the simple workaround for now to fix the
test - otherwise I need to get the test problem-listed.

If it does, we can know that this new crash is caused by the change in
my new patch.
I would recall it and rework it on aarch64.

It seems that it crashed while executing gtest in async mode.?? Could
you tell me which gtest it was executing?
"Command Line: -XX:+ExecutingUnitTests -Xlog:async"

[ RUN ] LogConfigurationTest.parse_log_arguments_vm
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000ffff40820720, pid=3972473, tid=3972481

HTH,
David

@mlbridge
Copy link

mlbridge bot commented Jun 2, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 1/06/2021 4:00 pm, David Holmes wrote:

Hi Xin,

On 1/06/2021 3:48 pm, Liu Xin wrote:

hi, David,

This is awkward. I think I changed too many in a patch.

Do you think it is a good idea that I revert to
https://openjdk.github.io/cr/?repo=jdk&pr=4257&range=00
<https://openjdk.github.io/cr/?repo=jdk&pr=4257&range=00>
It is a trivial bugfix.? Does it solve tier1~4 on aarch64?

I didn't test just that patch, but will try to do so.

The basic change passed tiers 1-4 testing.

Thanks,
David
-----

It might be best to revert to the simple workaround for now to fix the
test - otherwise I need to get the test problem-listed.

If it does, we can know that this new crash is caused by the change in
my new patch.
I would recall it and rework it on aarch64.

It seems that it crashed while executing gtest in async mode.?? Could
you tell me which gtest it was executing?
"Command Line: -XX:+ExecutingUnitTests -Xlog:async"

[ RUN????? ] LogConfigurationTest.parse_log_arguments_vm
#
# A fatal error has been detected by the Java Runtime Environment:
#
#? SIGSEGV (0xb) at pc=0x0000ffff40820720, pid=3972473, tid=3972481

HTH,
David

1 similar comment
@mlbridge
Copy link

mlbridge bot commented Jun 2, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 1/06/2021 4:00 pm, David Holmes wrote:

Hi Xin,

On 1/06/2021 3:48 pm, Liu Xin wrote:

hi, David,

This is awkward. I think I changed too many in a patch.

Do you think it is a good idea that I revert to
https://openjdk.github.io/cr/?repo=jdk&pr=4257&range=00
<https://openjdk.github.io/cr/?repo=jdk&pr=4257&range=00>
It is a trivial bugfix.? Does it solve tier1~4 on aarch64?

I didn't test just that patch, but will try to do so.

The basic change passed tiers 1-4 testing.

Thanks,
David
-----

It might be best to revert to the simple workaround for now to fix the
test - otherwise I need to get the test problem-listed.

If it does, we can know that this new crash is caused by the change in
my new patch.
I would recall it and rework it on aarch64.

It seems that it crashed while executing gtest in async mode.?? Could
you tell me which gtest it was executing?
"Command Line: -XX:+ExecutingUnitTests -Xlog:async"

[ RUN????? ] LogConfigurationTest.parse_log_arguments_vm
#
# A fatal error has been detected by the Java Runtime Environment:
#
#? SIGSEGV (0xb) at pc=0x0000ffff40820720, pid=3972473, tid=3972481

HTH,
David

@mlbridge
Copy link

mlbridge bot commented Jun 2, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 2/06/2021 2:01 am, Daniel D.Daugherty wrote:

On Mon, 31 May 2021 17:09:48 GMT, Xin Liu <xliu at openjdk.org> wrote:

The root cause of the intermittent failure is that _decorators in LogDecorations
may be inconsistent with LogOutput::_decorators. It could happen when gtest disables a
Log output via set_log_config(TestLogFileName, "all=off").

Since we copy the entire logDecorations, it's reasonable to copy _decorator as well.
LogDecorator is a bitmask of uint. It's even smaller than a reference on LP64 platforms.

This patch not only fixes intermittent crash in JDK-8267926, but also fixes JDK-8267952.
Two hazards of async logging in `LogConfiguration::configure_output` are eliminated.
This patch also adds a storestore barrier to make sure it's safe on weak consistency machines.

Xin Liu has updated the pull request incrementally with one additional commit since the last revision:

Increase reader's counter before creating LogDecoration.

It looks like this failure is taking a bit of work to resolve. I've gone ahead and
ProblemListed the test. At this point we have 20 sightings in Tier[3-7]. Since
this failure is intermittent, I would want to see Tier[5-7] testing done before
the fix is integrated.

I've submitted the basic fix for tiers 5-7 testing.

Xin: note you will need to merge with master and remove the test from
the Problem-list again as part of this PR.

Thanks,
David
-----

1 similar comment
@mlbridge
Copy link

mlbridge bot commented Jun 2, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 2/06/2021 2:01 am, Daniel D.Daugherty wrote:

On Mon, 31 May 2021 17:09:48 GMT, Xin Liu <xliu at openjdk.org> wrote:

The root cause of the intermittent failure is that _decorators in LogDecorations
may be inconsistent with LogOutput::_decorators. It could happen when gtest disables a
Log output via set_log_config(TestLogFileName, "all=off").

Since we copy the entire logDecorations, it's reasonable to copy _decorator as well.
LogDecorator is a bitmask of uint. It's even smaller than a reference on LP64 platforms.

This patch not only fixes intermittent crash in JDK-8267926, but also fixes JDK-8267952.
Two hazards of async logging in `LogConfiguration::configure_output` are eliminated.
This patch also adds a storestore barrier to make sure it's safe on weak consistency machines.

Xin Liu has updated the pull request incrementally with one additional commit since the last revision:

Increase reader's counter before creating LogDecoration.

It looks like this failure is taking a bit of work to resolve. I've gone ahead and
ProblemListed the test. At this point we have 20 sightings in Tier[3-7]. Since
this failure is intermittent, I would want to see Tier[5-7] testing done before
the fix is integrated.

I've submitted the basic fix for tiers 5-7 testing.

Xin: note you will need to merge with master and remove the test from
the Problem-list again as part of this PR.

Thanks,
David
-----

@mlbridge
Copy link

mlbridge bot commented Jun 2, 2021

Mailing list message from Liu Xin on hotspot-runtime-dev:

hi, David,

Thanks. Let's see the results of tier5~7 with the basic fix. If it can
pass all tests, it means that it's wrong to hoist AsyncLogWriter::flush().

Actually, I didn't change a lot in this PR. I hoisted the
AsyncLogWriter::flush() because in my understanding one flush() can remove
two hazards in async mode.
https://openjdk.github.io/cr/?repo=jdk&pr=4257&range=00-01#udiff-1-src/hotspot/share/logging/logConfiguration.cpp

Yesterday, when you ran the tier3~4 test, did all failures occur on aarch64?
If so, my only explanation is that this code may not be safe on aarch64.
That doesn't make sense because OrderAccess::storeload() is a full memory
barrier on aarch64.

void LogOutputList::wait_until_no_readers() const {
OrderAccess::storeload();
while (_active_readers != 0) {
// Busy wait
}
}

So far, I still can't reproduce the crash you sent me yesterday on aarch64.
I tried different configurations and even used cset to simulate different
numbers of cores.

On Tue, Jun 1, 2021 at 8:03 PM David Holmes <david.holmes at oracle.com> wrote:

On 2/06/2021 2:01 am, Daniel D.Daugherty wrote:

On Mon, 31 May 2021 17:09:48 GMT, Xin Liu <xliu at openjdk.org> wrote:

The root cause of the intermittent failure is that _decorators in
LogDecorations
may be inconsistent with LogOutput::_decorators. It could happen when
gtest disables a
Log output via set_log_config(TestLogFileName, "all=off").

Since we copy the entire logDecorations, it's reasonable to copy
_decorator as well.
LogDecorator is a bitmask of uint. It's even smaller than a reference
on LP64 platforms.

This patch not only fixes intermittent crash in JDK-8267926, but also
fixes JDK-8267952.
Two hazards of async logging in `LogConfiguration::configure_output`
are eliminated.
This patch also adds a storestore barrier to make sure it's safe on
weak consistency machines.

Xin Liu has updated the pull request incrementally with one additional
commit since the last revision:

Increase reader's counter before creating LogDecoration.

It looks like this failure is taking a bit of work to resolve. I've gone
ahead and
ProblemListed the test. At this point we have 20 sightings in Tier[3-7].
Since
this failure is intermittent, I would want to see Tier[5-7] testing done
before
the fix is integrated.

I've submitted the basic fix for tiers 5-7 testing.

Xin: note you will need to merge with master and remove the test from
the Problem-list again as part of this PR.

Thanks,
David
-----

@mlbridge
Copy link

mlbridge bot commented Jun 3, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 2/06/2021 1:02 pm, David Holmes wrote:

On 2/06/2021 2:01 am, Daniel D.Daugherty wrote:

On Mon, 31 May 2021 17:09:48 GMT, Xin Liu <xliu at openjdk.org> wrote:

The root cause of the intermittent failure is that _decorators in
LogDecorations
may be inconsistent with LogOutput::_decorators. It could happen
when gtest disables a
Log output via set_log_config(TestLogFileName, "all=off").

Since we copy the entire logDecorations, it's reasonable to copy
_decorator as well.
LogDecorator is a bitmask of uint. It's even smaller than a
reference on LP64 platforms.

This patch not only fixes intermittent crash in JDK-8267926, but
also fixes JDK-8267952.
Two hazards of async logging in `LogConfiguration::configure_output`
are eliminated.
This patch also adds a storestore barrier to make sure it's safe on
weak consistency machines.

Xin Liu has updated the pull request incrementally with one
additional commit since the last revision:

?? Increase reader's counter before creating LogDecoration.

It looks like this failure is taking a bit of work to resolve. I've
gone ahead and
ProblemListed the test. At this point we have 20 sightings in
Tier[3-7]. Since
this failure is intermittent, I would want to see Tier[5-7] testing
done before
the fix is integrated.

I've submitted the basic fix for tiers 5-7 testing.

That testing passed.

David
-----

Xin: note you will need to merge with master and remove the test from
the Problem-list again as part of this PR.

Thanks,
David
-----

1 similar comment
@mlbridge
Copy link

mlbridge bot commented Jun 3, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 2/06/2021 1:02 pm, David Holmes wrote:

On 2/06/2021 2:01 am, Daniel D.Daugherty wrote:

On Mon, 31 May 2021 17:09:48 GMT, Xin Liu <xliu at openjdk.org> wrote:

The root cause of the intermittent failure is that _decorators in
LogDecorations
may be inconsistent with LogOutput::_decorators. It could happen
when gtest disables a
Log output via set_log_config(TestLogFileName, "all=off").

Since we copy the entire logDecorations, it's reasonable to copy
_decorator as well.
LogDecorator is a bitmask of uint. It's even smaller than a
reference on LP64 platforms.

This patch not only fixes intermittent crash in JDK-8267926, but
also fixes JDK-8267952.
Two hazards of async logging in `LogConfiguration::configure_output`
are eliminated.
This patch also adds a storestore barrier to make sure it's safe on
weak consistency machines.

Xin Liu has updated the pull request incrementally with one
additional commit since the last revision:

?? Increase reader's counter before creating LogDecoration.

It looks like this failure is taking a bit of work to resolve. I've
gone ahead and
ProblemListed the test. At this point we have 20 sightings in
Tier[3-7]. Since
this failure is intermittent, I would want to see Tier[5-7] testing
done before
the fix is integrated.

I've submitted the basic fix for tiers 5-7 testing.

That testing passed.

David
-----

Xin: note you will need to merge with master and remove the test from
the Problem-list again as part of this PR.

Thanks,
David
-----

@navyxliu
Copy link
Member Author

navyxliu commented Jun 3, 2021

hi, @dholmes-ora ,

I have another guess. This crash sight could also be caused by log rotation!

Stack: [0x0000fffefdaa0000,0x0000fffefdca0000], sp=0x0000fffefdc9e510, free space=2041k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libc.so.6+0x60720] flockfile+0x0
V [libjvm.so+0x1508058] LogFileOutput::write_blocking(LogDecorations const&, char const*)+0x38
V [libjvm.so+0x1502764] AsyncLogWriter::run()+0x2c0
V [libjvm.so+0x1a49d08] Thread::call_run()+0xf8
V [libjvm.so+0x1766f18] thread_native_entry(Thread*)+0x108
C [libpthread.so.0+0x76e0] start_thread+0x198

It's possible that write_blocking is waken up from _rotation_semaphore.wait(), and proceed with writing.

int LogFileOutput::write_blocking(const LogDecorations& decorations, const char* msg) {
  _rotation_semaphore.wait();

  int written = LogFileStreamOutput::write(decorations, msg);
....
}

rotate() could fail at os::fopen() due to space or fnode limitation. It will leave _stream == NULL and make
LogFileStreamOutput::write(decorations, msg) fail at flockfile(NULL).

void LogFileOutput::rotate() {
...
  _stream = os::fopen(_file_name, FileOpenMode);
  if (_stream == NULL) {
    errstream->print_cr("Error opening log file '%s': %s",
                        _file_name, os::strerror(errno));
    return false;
  }

I don't know how to make it happen. The easiest way to verify my guess is to grep 'Error opening log file' if you still have those files. or you may try this revision. I remove AsyncLogGtest.java from problemlist.

a52019d

@navyxliu navyxliu marked this pull request as ready for review June 3, 2021 07:51
@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 3, 2021
@navyxliu
Copy link
Member Author

navyxliu commented Jun 3, 2021

I double down that one flush() can eliminate two hazards. 2 concurrent tests are provided to demonstrate LogConfiguration::configure_output is MT-safe.

Previously, the test was in bad shape which cause unnecessary rotation and quite big(100M+) on /tmp. I guess that's the root cause of failure on Linux/aarch64. patched.

This patch not only fixes intermittent crash in JDK-8267926, but also fixes JDK-8267952.

@navyxliu
Copy link
Member Author

navyxliu commented Jun 3, 2021

Sorry. I just realize a big problem of mine. I try to fix multiple problems in one same PR.
That makes this PR messy and also confuse reviewers. Let me revert them back first and just focus on JDK-8267926.

This is actually the first revision.

Copy link
Member

@phohensee phohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. If _decorators is set to a reference to an ephemeral value, random memory stomps may ensue.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 3, 2021
@navyxliu
Copy link
Member Author

navyxliu commented Jun 3, 2021

David helped me verify the latest patch using tier1~7.
@dholmes-ora , could you take a look this PR and see if it's safe to bring AsyncLogGtest.java back?

I would like to ensure the regression test covers the following changes of unified logging.

@navyxliu
Copy link
Member Author

navyxliu commented Jun 7, 2021

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Jun 7, 2021
@openjdk
Copy link

openjdk bot commented Jun 7, 2021

@navyxliu
Your change (at version 199f253) is now ready to be sponsored by a Committer.

@navyxliu
Copy link
Member Author

navyxliu commented Jun 7, 2021

Thanks all reviewers!

@dholmes-ora
Copy link
Member

/sponsor

@openjdk openjdk bot closed this Jun 7, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed sponsor Pull request is ready to be sponsored ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jun 7, 2021
@openjdk
Copy link

openjdk bot commented Jun 7, 2021

@dholmes-ora @navyxliu Since your change was applied there have been 89 commits pushed to the master branch:

  • 5ebd419: 8267972: Inline cache cleaning is not monotonic
  • 6d1f3ac: 8149138: [javadoc] Fix SerialFormBuilder eliminate String bashing
  • 58bdabc: 8268164: Adopt cast notation for WorkerThread conversions
  • 9fc914b: 8204686: Dynamic parallel reference processing support for Parallel GC
  • 908aca2: 8262891: Compiler implementation for Pattern Matching for switch (Preview)
  • 3e48244: 8268301: Closed test: compiler/c2/6371167/Test.java fails after JDK-8267904
  • 204b492: 8267703: runtime/cds/appcds/cacheObject/HeapFragmentationTest.java crashed with OutOfMemory
  • 2aeeeb4: 8268279: gc/shenandoah/compiler/TestLinkToNativeRBP.java fails after LibraryLookup is gone
  • b05fa02: 8267904: C2 crash when compile negative Arrays.copyOf length after loop
  • 95ddf7d: 8267839: trivial mem leak in numa
  • ... and 79 more: https://git.openjdk.java.net/jdk/compare/7530c00b33aac8918841dbae4d928956b60c261f...master

Your commit was automatically rebased without conflicts.

Pushed as commit b09d8b9.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants