Instability of artifact-caching-proxy #4442

darinpope · 2024-12-06T21:04:13Z

Service(s)

Artifact-caching-proxy

Summary

Bruno had to run the weekly BOM release process five times today (2024-12-06) because of errors like the following:

Could not transfer artifact com.google.crypto.tink:tink:jar:1.10.0 from/to azure-aks-internal (http://artifact-caching-proxy.artifact-caching-proxy.svc.cluster.local:8080/): Premature end of Content-Length delimited message body (expected: 2,322,048; received: 1,572,251)

Here's the issue where he tracked the build numbers so you can see the specific failures:

jenkinsci/bom#4066

I also had similar issues doing a BOM weekly-test against a core RC that I'm working on:

[DO NOT MERGE] weekly test with rc jenkinsci/bom#4072

Since I started working on BOM the past couple of months, this problem seems to be getting worse/more unstable as the weeks progress.

Reproduction steps

Unfortunately, it is not reproducible on demand.

The text was updated successfully, but these errors were encountered:

dduportal · 2024-12-13T13:44:07Z

Starting analysing logs on ACP side

dduportal · 2024-12-13T13:58:24Z

For each of the failing requests found in the past 15 days (including each one you folks logged) ACP did report an error due to the upstream, in the following categories:

upstream prematurely closed connection while reading upstream
peer closed connection in SSL handshake (104: Connection reset by peer) while SSL handshaking to upstream
upstream timed out (110: Operation timed out) while SSL handshaking to upstream
Error HTTP/500 responded by Artifactory

We also had 1 occurence repo.jenkins-ci.org could not be resolved (2: Server failure) which indicates a local DNS resolution error.

dduportal · 2024-12-13T14:27:55Z

=> The errors are definitively not due to an ACP problem. By design, it "reports" the error.
Eventually, some timeouts could be caused by the TCP tuning on the ACP instance: gotta check.

=> We could check if we can "retry" the upstream in case of error, I need to recall which cases could be caught

dduportal · 2024-12-13T14:41:10Z

@MarkEWaite did open a PR , based on a discussion we had during the previous infra meeting: jenkinsci/bom#4095

The goal is to "pre-heat" the cache to decrease the probability of facing these issues

dduportal · 2025-01-07T15:48:14Z

@MarkEWaite did open a PR , based on a discussion we had during the previous infra meeting: jenkinsci/bom#4095

The goal is to "pre-heat" the cache to decrease the probability of facing these issues

I haven't heard about any ACP problem with the BOM since the "pre-heat" PR was merged. Of course it might have been (I have not looked with due diligence).

Were there any issues in the past 3 weeks @basil @darinpope @Poddingue @MarkEWaite @alecharpentier?

For info, this issue is on stale, until we've finished migrated ci.jenkins.io to AWS (see #4313) which implies a new ACP instance (in a new infra).

dduportal · 2025-01-28T10:16:14Z

Updating: @darinpope reported to have suffered from ACP cache issues last Friday (https://ci.jenkins.io/job/Tools/job/bom/job/master/3786). We can correlate the errors to ACP errors messages in datadog:

All the errors are mapped to HTTP/503 responded by AWS S3:

10.0.149.154 - - [24/Jan/2025:16:20:39 +0000] "GET /org/jenkins-ci/main/jenkins-test-harness-htmlunit/187.v1e8425eb_77c5/jenkins-test-harness-htmlunit-187.v1e8425eb_77c5.jar HTTP/1.1" 503 0 0.097 "-" "Apache-Maven/3.9.9 (Java 17.0.13; Linux 6.1.119-129.201.amzn2023.x86_64)" "-" "52.22.177.50:443 : 52.217.117.1:443" "302 : 503" "0.064 : 0.032"

The "302 : 503" "0.064 : 0.032" indicates that the request from ACP to Artifactory was redirected ("302") to another URL which is always an AWS S3 (as per JFrog setup) bucket hosting the binary to download. The redirected URL has a JWT token valid for 1 hour in the query string: it used to be a configuration issue (stripping the token) but it would be a HTTP/403 or HTTP/401 error (depending on the config error).

The HTTP/503 on AWS could be a "Slow Down" error: https://repost.aws/questions/QU_F-UC6-fSdOYzp-gZSDTvQ/receiving-s3-503-slow-down-responses.

dduportal · 2025-01-28T10:21:10Z

Update: while working on #4317, I also faced the same issue when testing the BOM builds on the new controller aws.ci.jenkins.io (aimed at replacing ci.jenkins.io ASAP).

I had to enable SNI passing to upstream (jenkins-infra/kubernetes-management#6155) to get rid of this error in my tests.

Checking the Azure ACP logs (used in production today), I can find the same kind of SSL handshake errors during the bom build failures.

=> It looks sane to enable this on current ACP in Azure

Ref. jenkins-infra/helpdesk#4442 (comment)

dduportal · 2025-01-28T10:30:47Z

jenkins-infra/kubernetes-management#6159 has been applied: @darinpope I let you reopen the issue if you see again new ACP problems.

Note: we'll discuss at the FOSDEM 2025 the opportunity to use another caching system but on client side: https://plugins.jenkins.io/jobcacher/ has been mentioned by @basil as a great opportunity.
The discussion shall lead to a new issue if it looks like a good idea to work on.

alecharp · 2025-01-30T15:34:03Z

In the situation we have here, running PCT for all the plugins, shouldn't the local maven repository be the same for all the plugins? If that is correct, could a "pre-fetch" and the artifacts, caching that in a tarball and use that tarball for all the plugins be enough?

timja · 2025-01-30T15:43:44Z

Mark attempted something similar in jenkinsci/bom#4095 but yes that makes sense if we can stash and unstash the workspace for each build (potentially using a cloud plugin so aren't creating a large amount of traffic between controller and each agent?)

basil · 2025-01-30T15:44:36Z

Sure, I was simply bringing up the default behavior of GitHub actions to illustrate that the practice of caching workspaces as tarballs has been demonstrated to be practical at scale—something that has not been demonstrated with the artifact caching proxy approach.

dduportal · 2025-01-30T16:24:22Z

In the situation we have here, running PCT for all the plugins, shouldn't the local maven repository be the same for all the plugins? If that is correct, could a "pre-fetch" and the artifacts, caching that in a tarball and use that tarball for all the plugins be enough?

Mark attempted something similar in jenkinsci/bom#4095 but yes that makes sense if we can stash and unstash the workspace for each build (potentially using a cloud plugin so aren't creating a large amount of traffic between controller and each agent?)

The initial idea (tiny step) was to pre-fetch. Since Artifact Caching Proxy has 2 replicas, it clearly is not enough. Note this assumption was made with a set of ~350 parallel PCT tests in the BOM builds. I don't know the exact amount now but it is way higher.

Sure, I was simply bringing up the default behavior of GitHub actions to illustrate that the practice of caching workspaces as tarballs has been demonstrated to be practical at scale—something that has not been demonstrated with the artifact caching proxy approach.

@basil I really like the GitHub cache approach because it has a set of restrictions.

I would want to avoid any risk of cache poisoning with anyone opening a PR on a plugin which would add a malicious dependency in the cache.
Looking at https://plugins.jenkins.io/jobcacher/, I'm not sure how is the cache access is managed. It looks like the cache is segregated by job and by branch, and a new branch can has its cache seeded from a default branch.
=> Since it is provided by a pipeline step, we can assume that a tentative of cache poisoning could only come from a jenkinsci/ GH repository maintainer (as per our security rules). It looks good enough (if the cache cannot be written cross branch or cross jobs).

A few note after a first quick issue checks:

cache() is not a durable step Failed to create cache java.lang.InterruptedException jenkinsci/jobcacher-plugin#334 (comment)
It looks like it has a few issue on Windows nodes when writing cache (which... we should not have a lot)

=> did I understand correctly? Can I have another pair of eye on my hypothesis please?

basil · 2025-01-30T16:37:08Z

I haven’t used Job Cacher. Another thing to keep in mind is that caching the results of a build of jenkinsci/bom isn’t necessarily sufficient for running PCT against a particular plugin. Building and running a plugin’s tests pulls in test-scoped dependencies that wouldn’t have necessarily been cached during a build of the BOM itself.

basil · 2025-01-30T17:40:18Z

For what it’s worth, I have zero caching problems with local BOM and PCT builds. My local Maven cache (which is shared between all my builds) grows without bound until it gets too big, at which point I simply delete it and start over. The same philosophy could be applied to CI builds: additively aggregate the fetched artifacts from all completed builds to a central place, use it as the starting point for all new builds (transferred via tarball), and periodically delete it and start over when it gets too big. The main difference between this and an artifact caching proxy approach is the transfer mechanism: I am advocating for transferring the cache in a single tarball and TCP connection rather than creating hundreds of thousands of connections per build.

timja · 2025-01-31T09:05:51Z

PoC PR up at jenkinsci/bom#4351

If we're happy to try it then would need some storage, I assume Azure for now and switch to AWS as part of ci.jenkins.io move?

dduportal · 2025-01-31T09:12:45Z

PoC PR up at jenkinsci/bom#4351

If we're happy to try it then would need some storage, I assume Azure for now and switch to AWS as part of ci.jenkins.io move?

Let's start with controller storage (default) on Azure VM: less thing to set up (we can increase the size and performances of the data disk of ci.jenkins.io where JENKINS_HOME is)

timja · 2025-01-31T09:23:13Z

Let's start with controller storage (default) on Azure VM: less thing to set up (we can increase the size and performances of the data disk of ci.jenkins.io where JENKINS_HOME is)

I think default storage uses remoting, probably not ideal? but we can try it out

basil · 2025-01-31T09:53:26Z

Additively aggregating the fetched artifacts as I described above sounds tricky to implement in a CI job. Since all PCT runs are running in parallel, it sounds difficult to find a way to cache the superset of all the artifacts they fetched. Perhaps they could each stash their .m2 directory, then unstash all of the stashes to a single place to combine them all together, and then cache that? Conceptually we should think about how to aggregate the artifacts fetched by all PCT runs, not just a build of the BOM itself.

timja · 2025-01-31T10:09:59Z

Had a chat to @dduportal before, we're going to try first with caching just the master branch

dduportal · 2025-01-31T10:22:48Z

Additively aggregating the fetched artifacts as I described above sounds tricky to implement in a CI job. Since all PCT runs are running in parallel, it sounds difficult to find a way to cache the superset of all the artifacts they fetched. Perhaps they could each stash their .m2 directory, then unstash all of the stashes to a single place to combine them all together, and then cache that? Conceptually we should think about how to aggregate the artifacts fetched by all PCT runs, not just a build of the BOM itself.

I understand that we would need to set up a key cache for each stage based on its name (so it can get its cache from previous master branch build)

jonesbusy · 2025-02-02T11:50:06Z

Sadly I miss those discussion in the summit since I was in other groups.

We are using jobcacher in our company and works well with Artifactory backend. Also 2 collaborator are maintaining the jobcacher plugin (including myself), so I'm more than happy to give some help for any feature or fix are needed on the jobcacher plugin.

One of the limitation is that jobcacher seems to timeout for very large cache when they need more.than 5mn for upload or download. Seems to be related to CPS timeout.

jenkinsci/jobcacher-plugin#334

timja · 2025-02-02T13:10:46Z

Have a look through the PoC if you haven’t already: jenkinsci/bom#4351

dduportal · 2025-02-04T08:53:05Z

I've created #4525 to specifically tracks actions and discussions regarding the job cacher

jglick · 2025-02-10T19:28:10Z

the practice of caching workspaces as tarballs has been demonstrated to be practical at scale—something that has not been demonstrated with the artifact caching proxy approach

FWIW CloudBees runs a job vaguely analogous to bom in (PCT) scope and scale which uses an artifact caching proxy rather than workspace caching. The services and infrastructure are of course pretty different.

dduportal · 2025-02-19T14:10:02Z

Closing this issue as discussed in the team meeting (18 FeB 2025) for the following reasons:

The AWS ACP does seem to better handle the waves of requests. The ~15 BOM build including the one warming it up from scratch did not trigger errors.
Let's focus our effort on [ci.jenkins.io] Enable Maven dependencies client-side caching for BOM with Job Cacher #4525 which would benefit all BOM builds

dduportal · 2025-02-20T18:32:34Z

Closing this issue as discussed in the team meeting (18 FeB 2025) for the following reasons:

* The AWS ACP does seem to better handle the waves of requests. The ~15 BOM build including the one warming it up from scratch did not trigger errors.

* Let's focus our effort on [[ci.jenkins.io] Enable Maven dependencies client-side caching for BOM with Job Cacher #4525](https://github.com/jenkins-infra/helpdesk/issues/4525) which would benefit all BOM builds

Looks like I was wrong: #4545 (comment)

darinpope added the triage Incoming issues that need review label Dec 6, 2024

jenkins-infra-helpdesk-app bot added the artifact-caching-proxy label Dec 6, 2024

darinpope changed the title ~~High number of~~ Instability of artifact-caching-proxy Dec 6, 2024

dduportal added this to the infra-team-sync-2024-12-10 milestone Dec 7, 2024

dduportal removed the triage Incoming issues that need review label Dec 9, 2024

dduportal self-assigned this Dec 9, 2024

dduportal modified the milestones: infra-team-sync-2024-12-10, infra-team-sync-2024-12-17 Dec 10, 2024

dduportal modified the milestones: infra-team-sync-2024-12-17, infra-team-sync-2025-01-07 Dec 18, 2024

dduportal modified the milestones: infra-team-sync-2025-01-07, infra-team-sync-2025-01-14 Jan 7, 2025

dduportal modified the milestones: infra-team-sync-2025-01-14, infra-team-sync-2025-02-04 Jan 7, 2025

dduportal mentioned this issue Jan 7, 2025

build failure with useArtifactCachingProxy=true and dependency with version range #4426

Open

dduportal modified the milestones: infra-team-sync-2025-02-04, infra-team-sync-2025-01-28 Jan 28, 2025

dduportal added a commit to jenkins-infra/kubernetes-management that referenced this issue Jan 28, 2025

fix(ci.jio-agents-1/ACP) Pass SNI to upstreams

bb7d61a

Ref. jenkins-infra/helpdesk#4442 (comment)

dduportal mentioned this issue Jan 28, 2025

fix(ci.jio-agents-1/ACP) Pass SNI to upstreams jenkins-infra/kubernetes-management#6159

Merged

dduportal added a commit to jenkins-infra/kubernetes-management that referenced this issue Jan 28, 2025

fix(ci.jio-agents-1/ACP) Pass SNI to upstreams (#6159)

bc9345f

Ref. jenkins-infra/helpdesk#4442 (comment)

dduportal closed this as completed Jan 28, 2025

dduportal reopened this Jan 30, 2025

alecharp mentioned this issue Jan 30, 2025

[RELEASE] New release for 2025-01-30 jenkinsci/bom#4346

Closed

12 tasks

dduportal modified the milestones: infra-team-sync-2025-01-28, infra-team-sync-2025-02-11 Jan 30, 2025

basil mentioned this issue Jan 31, 2025

Pre-heat the artifact caching proxy jenkinsci/bom#4095

Merged

6 tasks

timja mentioned this issue Jan 31, 2025

Job cacher PoC jenkinsci/bom#4351

Draft

6 tasks

dduportal mentioned this issue Feb 4, 2025

[ci.jenkins.io] Enable Maven dependencies client-side caching for BOM with Job Cacher #4525

Open

7 tasks

dduportal modified the milestones: infra-team-sync-2025-02-11, infra-team-sync-2025-02-18 Feb 11, 2025

dduportal modified the milestones: infra-team-sync-2025-02-18, infra-team-sync-2025-02-25 Feb 19, 2025

dduportal closed this as completed Feb 19, 2025

github-actions bot mentioned this issue Feb 19, 2025

Instability of artifact-caching-proxy on AWS #4545

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instability of artifact-caching-proxy #4442

Instability of artifact-caching-proxy #4442

darinpope commented Dec 6, 2024 •

edited

Loading

dduportal commented Dec 13, 2024

dduportal commented Dec 13, 2024

dduportal commented Dec 13, 2024

dduportal commented Dec 13, 2024

dduportal commented Jan 7, 2025

dduportal commented Jan 28, 2025

dduportal commented Jan 28, 2025

dduportal commented Jan 28, 2025 •

edited

Loading

alecharp commented Jan 30, 2025

timja commented Jan 30, 2025

basil commented Jan 30, 2025

dduportal commented Jan 30, 2025

basil commented Jan 30, 2025

basil commented Jan 30, 2025

timja commented Jan 31, 2025

dduportal commented Jan 31, 2025

timja commented Jan 31, 2025

basil commented Jan 31, 2025

timja commented Jan 31, 2025

dduportal commented Jan 31, 2025

jonesbusy commented Feb 2, 2025 •

edited

Loading

timja commented Feb 2, 2025

dduportal commented Feb 4, 2025

jglick commented Feb 10, 2025

dduportal commented Feb 19, 2025

dduportal commented Feb 20, 2025

Instability of artifact-caching-proxy #4442

Instability of artifact-caching-proxy #4442

Comments

darinpope commented Dec 6, 2024 • edited Loading

Service(s)

Summary

Reproduction steps

dduportal commented Dec 13, 2024

dduportal commented Dec 13, 2024

dduportal commented Dec 13, 2024

dduportal commented Dec 13, 2024

dduportal commented Jan 7, 2025

dduportal commented Jan 28, 2025

dduportal commented Jan 28, 2025

dduportal commented Jan 28, 2025 • edited Loading

alecharp commented Jan 30, 2025

timja commented Jan 30, 2025

basil commented Jan 30, 2025

dduportal commented Jan 30, 2025

basil commented Jan 30, 2025

basil commented Jan 30, 2025

timja commented Jan 31, 2025

dduportal commented Jan 31, 2025

timja commented Jan 31, 2025

basil commented Jan 31, 2025

timja commented Jan 31, 2025

dduportal commented Jan 31, 2025

jonesbusy commented Feb 2, 2025 • edited Loading

timja commented Feb 2, 2025

dduportal commented Feb 4, 2025

jglick commented Feb 10, 2025

dduportal commented Feb 19, 2025

dduportal commented Feb 20, 2025

darinpope commented Dec 6, 2024 •

edited

Loading

dduportal commented Jan 28, 2025 •

edited

Loading

jonesbusy commented Feb 2, 2025 •

edited

Loading