Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky/Intermittent FT test timeout in native image: tests.integration.nativeimage.mp1.Mp1Main #4296

Closed
ljnelson opened this issue May 31, 2022 · 6 comments

Comments

@ljnelson
Copy link
Member

One example: https://builds.helidon.io/678781B6AE357E68F35B0CF7E3286FDA/logs/34

io.helidon.common.Errors$ErrorMessagesException: FATAL: FT Fallback failed. java.lang.RuntimeException: intentional failure at class io.helidon.tests.integration.nativeimage.mp1.Mp1Main
FATAL: FT Retry failed. java.lang.RuntimeException: Failed at class io.helidon.tests.integration.nativeimage.mp1.Mp1Main
FATAL: FT timeout, expected "timeout", actual: "This method should have timed out after 1 second" at class io.helidon.tests.integration.nativeimage.mp1.Mp1Main
	at io.helidon.common.Errors.checkValid(Errors.java:183)
	at io.helidon.tests.integration.nativeimage.mp1.Mp1Main.testBean(Mp1Main.java:258)
	at io.helidon.tests.integration.nativeimage.mp1.Mp1Main.main(Mp1Main.java:103)
Tests finished in 3200 millis
@tjquinno
Copy link
Member

I have seen this in pipeline jobs at least twice in the past few days also. Native only.

@spericas spericas self-assigned this Jun 2, 2022
@spericas
Copy link
Member

spericas commented Jun 2, 2022

Quick test on a heavily loaded Mac with GraalVM 21.1.0, no failures.

(~/Projects/helidon-spericas-master/tests/integration/native-image/mp-1) >> for i in 1 2 3 4 5 6 7 8 9 10; do target/helidon-tests
-native-image-mp-1; done 2>&1  > test.log
ABean.timeout() has been interrupted after 1004 millis
ABean.timeout() has been interrupted after 1003 millis
ABean.timeout() has been interrupted after 1009 millis
ABean.timeout() has been interrupted after 1003 millis
ABean.timeout() has been interrupted after 1001 millis
ABean.timeout() has been interrupted after 1004 millis
ABean.timeout() has been interrupted after 1000 millis
ABean.timeout() has been interrupted after 1004 millis
ABean.timeout() has been interrupted after 1002 millis
ABean.timeout() has been interrupted after 1000 millis

test.log

@spericas
Copy link
Member

spericas commented Jun 3, 2022

I've also tested it with GraalVM 21.3.0 + JDK 11 with same outcome.

@barchetta barchetta changed the title Flaky/Intermittent FT test timeout in native image Flaky/Intermittent FT test timeout in native image: tests.integration.nativeimage.mp1.Mp1Main Jun 3, 2022
@barchetta
Copy link
Member

barchetta commented Jun 3, 2022

We've made some progress and it is looking like this is a regression. The findings:

  1. The issue is intermittent
  2. Intermittently native-image builds a binary where MicroProfile FaultTolerance does not work. There is no obvious error or issue during the building of the native-image
  3. Once a "bad" binary is built, the FT tests fail every time the binary is run
  4. I have reproduced this on Oracle Linux 7.7 using graalvm-ce-java11-21.3.0. Two out of six rebuilds resulted in a bad binary
  5. Inspecting the good and bad binaries using strings and grep show differences like this:
126c126
< io.helidon.microprofile.faulttolerance.CommandFallback$1
---
> io.helidon.microprofile.faulttolerance.CommandFallback$1`=

So the "good" binary had entries with the trailing = (significance unknown)
6. After reverting helidon-2.x branch to commit 0c3530d the issue appears to go away. I was able to rebuild the binary 18 times without seeing the issue.
7. So that seems to point the finger at #4216 as introducing a regression. This also lines up with when we started to see this failure in the pipeline.
8. This does not show up in the 3.x pipelines because the test is currently disabled in 3.x

@spericas
Copy link
Member

spericas commented Jun 8, 2022

This seems to be an intermittent problem in Graal related to annotation type packages.

When it works:

2022.06.07 22:04:21 FINE io.helidon.microprofile.faulttolerance.AnnotationFinder Thread[main,5,main]: @@@ a1 is @org.eclipse.microprofile.faulttolerance.Timeout
2022.06.07 22:04:21 FINE io.helidon.microprofile.faulttolerance.AnnotationFinder Thread[main,5,main]: @@@ a1.annotationType() is interface org.eclipse.microprofile.faulttolerance.Timeout
2022.06.07 22:04:21 FINE io.helidon.microprofile.faulttolerance.AnnotationFinder Thread[main,5,main]: @@@ a1.annotationType().getPackage() package org.eclipse.microprofile.faulttolerance
2022.06.07 22:04:21 FINE io.helidon.microprofile.faulttolerance.AnnotationFinder Thread[main,5,main]: @@@ a1.annotationType().getPackage().getName() org.eclipse.microprofile.faulttolerance
2022.06.07 22:04:21 FINE io.helidon.microprofile.faulttolerance.AnnotationFinder Thread[main,5,main]: @@@ a1.getClass() class org.eclipse.microprofile.faulttolerance.Timeout$$ProxyImpl
2022.06.07 22:04:21 FINE io.helidon.microprofile.faulttolerance.AnnotationFinder Thread[main,5,main]: @@@ a1.getClass().getPackage() package org.eclipse.microprofile.faulttolerance

When it fails:

2022.06.07 22:09:43 FINE io.helidon.microprofile.faulttolerance.AnnotationFinder Thread[main,5,main]: @@@ a1 is @org.eclipse.microprofile.faulttolerance.Timeout
2022.06.07 22:09:43 FINE io.helidon.microprofile.faulttolerance.AnnotationFinder Thread[main,5,main]: @@@ a1.annotationType() is interface org.eclipse.microprofile.faulttolerance.Timeout
2022.06.07 22:09:43 FINE io.helidon.microprofile.faulttolerance.AnnotationFinder Thread[main,5,main]: @@@ a1.annotationType().getPackage() package com.sun.proxy
2022.06.07 22:09:43 FINE io.helidon.microprofile.faulttolerance.AnnotationFinder Thread[main,5,main]: @@@ a1.annotationType().getPackage().getName() com.sun.proxy
2022.06.07 22:09:43 FINE io.helidon.microprofile.faulttolerance.AnnotationFinder Thread[main,5,main]: @@@ a1.getClass() class org.eclipse.microprofile.faulttolerance.Timeout$$ProxyImpl
2022.06.07 22:09:43 FINE io.helidon.microprofile.faulttolerance.AnnotationFinder Thread[main,5,main]: @@@ a1.getClass().getPackage() package com.sun.proxy

The annotation type package should not return the proxy's package. For some reason, this decision is baked into the executable that native-image generates. A bad binary is created about 1 in 4 times on Linux with 21.3.0. Graal 21.3.2 seems better as our analysis required 11 tries to get a failure. The fix is simply to avoid calling getPackage in this case.

@spericas
Copy link
Member

spericas commented Jun 8, 2022

PR #4349
PR #4350

@spericas spericas closed this as completed Jun 8, 2022
@m0mus m0mus added this to Backlog Aug 12, 2024
@m0mus m0mus moved this to Closed in Backlog Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

4 participants