Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible change to ci.jenkins.io's linux agents causes core test failure #3890

Closed
NotMyFault opened this issue Jan 6, 2024 · 29 comments · Fixed by jenkinsci/jenkins#9374
Closed

Comments

@NotMyFault
Copy link
Member

Service(s)

ci.jenkins.io

Summary

Hey,

It appears that a change to the linux agents with Java 17 and 21 caused a test failure in Jenkins core. Around January the 5fth, MavenTest#sensitiveParameters() started to take much longer than usual.

Starting with 5730 the single test takes 229s (!!), exceeding our test timeout of 180s by far, causing builds to fail.

5728 notes ~39s for the entire MavenTest, build on January the 3rd.

Locally, the entire MavenTest takes less than 30s to complete for me using mvn clean verify -Dtest=hudson.tasks.MavenTest.

Interestingly, this affects linux agents with Java 17 and 21 only. The test passes in its usual time on windows agents with Java 17 and linux agents with Java 11: https://ci.jenkins.io/job/Core/job/jenkins/job/master/5730/pipeline-graph/

Reproduction steps

No response

@NotMyFault
Copy link
Member Author

I have just increased the timeout from 300s to 500s (!!!) because build 5736 reports a duration of 350s it takes to complete the test on a linux agent with Java 21.

@MarkEWaite
Copy link

Git plugin builds are failing due to 60 minute test timeout limit on the Windows agents. Inside the failed build logs is this message:

00:18:16  [WARNING] Could not validate integrity of download from https://repo.azure.jenkins.io/public/org/apache/mina/mina-core/2.2.3/mina-core-2.2.3.jar
00:18:16  org.eclipse.aether.transfer.ChecksumFailureException: Checksum validation failed, no checksums available
00:18:16      at org.eclipse.aether.internal.impl.AbstractChecksumPolicy.onNoMoreChecksums (AbstractChecksumPolicy.java:63)
00:18:16      at org.eclipse.aether.connector.basic.ChecksumValidator.validate (ChecksumValidator.java:112)
00:18:16      at org.eclipse.aether.connector.basic.BasicRepositoryConnector$GetTaskRunner.runTask (BasicRepositoryConnector.java:484)
00:18:16      at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:414)
00:18:16      at org.eclipse.aether.util.concurrency.RunnableErrorForwarder.lambda$wrap$0 (RunnableErrorForwarder.java:66)
00:18:16      at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1136)
00:18:16      at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:635)
00:18:16      at java.lang.Thread.run (Thread.java:833)

Immediately after that message is logged, there is a multiple minute delay in log entries

When I replay the same job with the added Pipeline parameter useArtifactCachingProxy: false, the build passes.

@lemeurherve
Copy link
Member

Likely related: #3889 (comment)

@dduportal dduportal removed the triage Incoming issues that need review label Jan 8, 2024
@dduportal dduportal added this to the infra-team-sync-2024-01-09 milestone Jan 8, 2024
@NotMyFault
Copy link
Member Author

Likely related: #3889 (comment)

I've replayed a PR with the artifact opt out an the test failed with the specific reason still.

@dduportal
Copy link
Contributor

Likely related: #3889 (comment)

I've replayed a PR with the artifact opt out an the test failed with the specific reason still.

I believe @lemeurherve comment was aimed at @MarkEWaite (e.g. the message ould not validate integrity of download from https://repo.azure.jenkins.io/...).

About the problem mentioned in this issue on Jenkins Core (tests timeouts), there isn't anythin obvious in the image changelog: https://github.com/jenkins-infra/packer-images/releases/tag/1.45.0 . Might be related to any updated packages in the Ubuntu distribution for these images, or maybe something else.

We are diagnosing, but the only thing sure is that it appeared during the 5th of January: we still don't have any other proof for now

@dduportal
Copy link
Contributor

We've deployed a new image version (1.46.0) in jenkins-infra/jenkins-infra#3246 with a bunch of upgraded Ubuntu packages (including kernel).

@NotMyFault Can you retry a build to see the behavior? (what changed since yesterday: ACP fully cleaned up + new template image).

@dduportal
Copy link
Contributor

Additionally, could be linked with #3874 (but I see that JDK11 and JDK17 are also having timeouts).

What is the test doing exactly? Is it requiring network operation "en-masse"?

@NotMyFault
Copy link
Member Author

I'll give it a shot later today. But I can build it normally with the acp enabled?

@dduportal
Copy link
Contributor

I'll give it a shot later today. But I can build it normally with the acp enabled?

I'm not sure to fully understand your question, could you clarify?

@NotMyFault
Copy link
Member Author

NotMyFault commented Jan 8, 2024

Removing the timeout increase results in the same outcome as before: https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/1/

The test case takes 230s still.

@dduportal
Copy link
Contributor

A first check of the resource usage centered on the JDK21 case:

Capture d’écran 2024-01-08 à 16 27 53
  • Memory usage:
Capture d’écran 2024-01-08 à 16 28 15

vs.

Capture d’écran 2024-01-08 à 16 26 41
  • Memory usage:
Capture d’écran 2024-01-08 à 16 26 28

@dduportal
Copy link
Contributor

=> not so much differences on the resource usages. it's really weird 🤔

@dduportal
Copy link
Contributor

@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior

@dduportal
Copy link
Contributor

@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior

Launched https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2/ with the 1.44.0 version. Let's see

@NotMyFault
Copy link
Member Author

@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior

Launched ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2 with the 1.44.0 version. Let's see

https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2/testReport/ :(

@dduportal
Copy link
Contributor

@NotMyFault I'm going to rollback to 1.44.0 (the image used before the 5th of January) to see if we can return to previous behavior

Launched ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2 with the 1.44.0 version. Let's see

https://ci.jenkins.io/job/Core/job/jenkins/job/PR-8842/2/testReport/ :(

Ok so it is not the image. Gotta check other things then

MarkEWaite pushed a commit to krisstern/jenkins that referenced this issue Jan 9, 2024
)

Infrastructure issue that we don't want to allow to delay the release
candidate of 2.426.3.

jenkins-infra/helpdesk#3890 is the issue

(cherry picked from commit 3dbbf26)
@MarkEWaite
Copy link

I backported the WithTimeout change from @NotMyFault to the 2.426.3 backporting pull request so that the 2.426.3 release candidate on Wednesday 10 Jan 2024 would not risk being delayed. The change is purely in a test, so very low risk.
The change is jenkinsci/jenkins@914f0d4

@NotMyFault
Copy link
Member Author

@dduportal Following up the chat in matrix, I've replayed the PR with the reverted image and the test is no longer failing on any JDK.

@dduportal
Copy link
Contributor

@NotMyFault can you confirm in the next 48h it is still ok by closing the issue with a comment?

@NotMyFault
Copy link
Member Author

@NotMyFault can you confirm in the next 48h it is still ok by closing the issue with a comment?

Wouldn't bumping the template again in the future result in a failing test, as before?

@dduportal
Copy link
Contributor

@NotMyFault can you confirm in the next 48h it is still ok by closing the issue with a comment?

Wouldn't bumping the template again in the future result in a failing test, as before?

I don’t see the causality here. It was the same day (eg apparent correlation) but using 1.44.0 (previous) template also triggered the timeout issue which shows it is unrelated. Then switching back to 1.45.0 showed the issue fixed and 1.46 also.

it means we don’t really know what caused but clearly not the template.

however we can keep searching what changed. The network is a good clue as we changed a lot of things these past days.

@dduportal dduportal modified the milestones: infra-team-sync-2024-01-09, infra-team-sync-2024-01-16 Jan 10, 2024
@NotMyFault
Copy link
Member Author

it means we don’t really know what caused but clearly not the template.

I guess we can close the issue for now. The timeout is set to 500s and thanks to @MarkEWaite this change is present on the current LTS line too, preventing future issues.

@NotMyFault NotMyFault closed this as not planned Won't fix, can't repro, duplicate, stale Jan 10, 2024
@basil
Copy link
Collaborator

basil commented Jan 10, 2024

There should be an issue tracking the removal of the workaround.

@basil
Copy link
Collaborator

basil commented Jan 17, 2024

I have filed JENKINS-72577 to track the removal of the workaround added in jenkinsci/jenkins#8840. That JENKINS-72577 in turn depends on the resolution of the operational issue that led to the workaround, tracked in this (now reopened) ticket.

@basil basil reopened this Jan 17, 2024
@dduportal
Copy link
Contributor

The issue is marked as related to ACP, but I still haven't found any clue that it is (or it is not). Did you see anything pointing in its direction?

@basil
Copy link
Collaborator

basil commented Jan 17, 2024

The very slow test runtime is only happening on CI infrastructure, as the test takes only 5 seconds for me locally, even when running in a clean Docker container without a warm Maven cache. Whether the operational issue is due to ACP or some other area of infrastructure I cannot be sure, but I suspect ACP because the test is invoking a second Maven process, and we have seen various ACP complications in the past running Maven within Maven such as jenkinsci/maven-hpi-plugin#541. I have updated the wording of JENKINS-72577 to clarify the above.

@dduportal
Copy link
Contributor

The very slow test runtime is only happening on CI infrastructure, as the test takes only 5 seconds for me locally, even when running in a clean Docker container without a warm Maven cache. Whether the operational issue is due to ACP or some other area of infrastructure I cannot be sure, but I suspect ACP because the test is invoking a second Maven process, and we have seen various ACP complications in the past running Maven within Maven such as jenkinsci/maven-hpi-plugin#541. I have updated the wording of JENKINS-72577 to clarify the above.

No problem: I did not had enough time (neither the rest of the team) to check in details.

Reopening the issue is the correct move (thanks for doing it!) as the problem is present on ci.jenkins.io infrastructure: we acknowledge it.

My question was really in case you already saw things that we missed. Thanks for the explanation

@dduportal
Copy link
Contributor

Opened jenkinsci/jenkins#9374 as the tests are now around ~5s on Linux with JDK21, following 4 months of infrastructure changes (see PR message).

@dduportal dduportal self-assigned this Jun 11, 2024
@dduportal dduportal added this to the infra-team-sync-2024-06-11 milestone Jun 11, 2024
@dduportal
Copy link
Contributor

Closing the infra issue as we confirmed the problem is now gone.

Besides, jenkinsci/jenkins#9374 has been approved and will merged, closing https://issues.jenkins.io/browse/JENKINS-72577

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants