Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instability of artifact-caching-proxy #4442

Closed
darinpope opened this issue Dec 6, 2024 · 30 comments
Closed

Instability of artifact-caching-proxy #4442

darinpope opened this issue Dec 6, 2024 · 30 comments

Comments

@darinpope
Copy link
Collaborator

darinpope commented Dec 6, 2024

Service(s)

Artifact-caching-proxy

Summary

Bruno had to run the weekly BOM release process five times today (2024-12-06) because of errors like the following:

  • Could not transfer artifact com.google.crypto.tink:tink:jar:1.10.0 from/to azure-aks-internal (http://artifact-caching-proxy.artifact-caching-proxy.svc.cluster.local:8080/): Premature end of Content-Length delimited message body (expected: 2,322,048; received: 1,572,251)

Here's the issue where he tracked the build numbers so you can see the specific failures:

jenkinsci/bom#4066

I also had similar issues doing a BOM weekly-test against a core RC that I'm working on:

Since I started working on BOM the past couple of months, this problem seems to be getting worse/more unstable as the weeks progress.

Reproduction steps

Unfortunately, it is not reproducible on demand.

@darinpope darinpope added the triage Incoming issues that need review label Dec 6, 2024
@darinpope darinpope changed the title High number of Instability of artifact-caching-proxy Dec 6, 2024
@dduportal dduportal added this to the infra-team-sync-2024-12-10 milestone Dec 7, 2024
@dduportal dduportal removed the triage Incoming issues that need review label Dec 9, 2024
@dduportal dduportal self-assigned this Dec 9, 2024
@dduportal
Copy link
Contributor

Starting analysing logs on ACP side

@dduportal
Copy link
Contributor

For each of the failing requests found in the past 15 days (including each one you folks logged) ACP did report an error due to the upstream, in the following categories:

  • upstream prematurely closed connection while reading upstream
  • peer closed connection in SSL handshake (104: Connection reset by peer) while SSL handshaking to upstream
  • upstream timed out (110: Operation timed out) while SSL handshaking to upstream
  • Error HTTP/500 responded by Artifactory

We also had 1 occurence repo.jenkins-ci.org could not be resolved (2: Server failure) which indicates a local DNS resolution error.

@dduportal
Copy link
Contributor

=> The errors are definitively not due to an ACP problem. By design, it "reports" the error.
Eventually, some timeouts could be caused by the TCP tuning on the ACP instance: gotta check.

=> We could check if we can "retry" the upstream in case of error, I need to recall which cases could be caught

@dduportal
Copy link
Contributor

@MarkEWaite did open a PR , based on a discussion we had during the previous infra meeting: jenkinsci/bom#4095

The goal is to "pre-heat" the cache to decrease the probability of facing these issues

@dduportal
Copy link
Contributor

@MarkEWaite did open a PR , based on a discussion we had during the previous infra meeting: jenkinsci/bom#4095

The goal is to "pre-heat" the cache to decrease the probability of facing these issues

I haven't heard about any ACP problem with the BOM since the "pre-heat" PR was merged. Of course it might have been (I have not looked with due diligence).

Were there any issues in the past 3 weeks @basil @darinpope @Poddingue @MarkEWaite @alecharpentier?

For info, this issue is on stale, until we've finished migrated ci.jenkins.io to AWS (see #4313) which implies a new ACP instance (in a new infra).

@dduportal
Copy link
Contributor

Updating: @darinpope reported to have suffered from ACP cache issues last Friday (https://ci.jenkins.io/job/Tools/job/bom/job/master/3786). We can correlate the errors to ACP errors messages in datadog:

Image

All the errors are mapped to HTTP/503 responded by AWS S3:

10.0.149.154 - - [24/Jan/2025:16:20:39 +0000] "GET /org/jenkins-ci/main/jenkins-test-harness-htmlunit/187.v1e8425eb_77c5/jenkins-test-harness-htmlunit-187.v1e8425eb_77c5.jar HTTP/1.1" 503 0 0.097 "-" "Apache-Maven/3.9.9 (Java 17.0.13; Linux 6.1.119-129.201.amzn2023.x86_64)" "-" "52.22.177.50:443 : 52.217.117.1:443" "302 : 503" "0.064 : 0.032"

The "302 : 503" "0.064 : 0.032" indicates that the request from ACP to Artifactory was redirected ("302") to another URL which is always an AWS S3 (as per JFrog setup) bucket hosting the binary to download. The redirected URL has a JWT token valid for 1 hour in the query string: it used to be a configuration issue (stripping the token) but it would be a HTTP/403 or HTTP/401 error (depending on the config error).

The HTTP/503 on AWS could be a "Slow Down" error: https://repost.aws/questions/QU_F-UC6-fSdOYzp-gZSDTvQ/receiving-s3-503-slow-down-responses.

@dduportal
Copy link
Contributor

Update: while working on #4317, I also faced the same issue when testing the BOM builds on the new controller aws.ci.jenkins.io (aimed at replacing ci.jenkins.io ASAP).

I had to enable SNI passing to upstream (jenkins-infra/kubernetes-management#6155) to get rid of this error in my tests.

Checking the Azure ACP logs (used in production today), I can find the same kind of SSL handshake errors during the bom build failures.

=> It looks sane to enable this on current ACP in Azure

@dduportal
Copy link
Contributor

dduportal commented Jan 28, 2025

jenkins-infra/kubernetes-management#6159 has been applied: @darinpope I let you reopen the issue if you see again new ACP problems.

Note: we'll discuss at the FOSDEM 2025 the opportunity to use another caching system but on client side: https://plugins.jenkins.io/jobcacher/ has been mentioned by @basil as a great opportunity.
The discussion shall lead to a new issue if it looks like a good idea to work on.

@dduportal dduportal reopened this Jan 30, 2025
@alecharp
Copy link

In the situation we have here, running PCT for all the plugins, shouldn't the local maven repository be the same for all the plugins? If that is correct, could a "pre-fetch" and the artifacts, caching that in a tarball and use that tarball for all the plugins be enough?

@timja
Copy link
Member

timja commented Jan 30, 2025

Mark attempted something similar in jenkinsci/bom#4095 but yes that makes sense if we can stash and unstash the workspace for each build (potentially using a cloud plugin so aren't creating a large amount of traffic between controller and each agent?)

@basil
Copy link
Collaborator

basil commented Jan 30, 2025

Sure, I was simply bringing up the default behavior of GitHub actions to illustrate that the practice of caching workspaces as tarballs has been demonstrated to be practical at scale—something that has not been demonstrated with the artifact caching proxy approach.

@dduportal
Copy link
Contributor

In the situation we have here, running PCT for all the plugins, shouldn't the local maven repository be the same for all the plugins? If that is correct, could a "pre-fetch" and the artifacts, caching that in a tarball and use that tarball for all the plugins be enough?

Mark attempted something similar in jenkinsci/bom#4095 but yes that makes sense if we can stash and unstash the workspace for each build (potentially using a cloud plugin so aren't creating a large amount of traffic between controller and each agent?)

The initial idea (tiny step) was to pre-fetch. Since Artifact Caching Proxy has 2 replicas, it clearly is not enough. Note this assumption was made with a set of ~350 parallel PCT tests in the BOM builds. I don't know the exact amount now but it is way higher.

Sure, I was simply bringing up the default behavior of GitHub actions to illustrate that the practice of caching workspaces as tarballs has been demonstrated to be practical at scale—something that has not been demonstrated with the artifact caching proxy approach.

@basil I really like the GitHub cache approach because it has a set of restrictions.

I would want to avoid any risk of cache poisoning with anyone opening a PR on a plugin which would add a malicious dependency in the cache.
Looking at https://plugins.jenkins.io/jobcacher/, I'm not sure how is the cache access is managed. It looks like the cache is segregated by job and by branch, and a new branch can has its cache seeded from a default branch.
=> Since it is provided by a pipeline step, we can assume that a tentative of cache poisoning could only come from a jenkinsci/ GH repository maintainer (as per our security rules). It looks good enough (if the cache cannot be written cross branch or cross jobs).

A few note after a first quick issue checks:

=> did I understand correctly? Can I have another pair of eye on my hypothesis please?

@basil
Copy link
Collaborator

basil commented Jan 30, 2025

I haven’t used Job Cacher. Another thing to keep in mind is that caching the results of a build of jenkinsci/bom isn’t necessarily sufficient for running PCT against a particular plugin. Building and running a plugin’s tests pulls in test-scoped dependencies that wouldn’t have necessarily been cached during a build of the BOM itself.

@basil
Copy link
Collaborator

basil commented Jan 30, 2025

For what it’s worth, I have zero caching problems with local BOM and PCT builds. My local Maven cache (which is shared between all my builds) grows without bound until it gets too big, at which point I simply delete it and start over. The same philosophy could be applied to CI builds: additively aggregate the fetched artifacts from all completed builds to a central place, use it as the starting point for all new builds (transferred via tarball), and periodically delete it and start over when it gets too big. The main difference between this and an artifact caching proxy approach is the transfer mechanism: I am advocating for transferring the cache in a single tarball and TCP connection rather than creating hundreds of thousands of connections per build.

@timja
Copy link
Member

timja commented Jan 31, 2025

PoC PR up at jenkinsci/bom#4351

If we're happy to try it then would need some storage, I assume Azure for now and switch to AWS as part of ci.jenkins.io move?

@dduportal
Copy link
Contributor

PoC PR up at jenkinsci/bom#4351

If we're happy to try it then would need some storage, I assume Azure for now and switch to AWS as part of ci.jenkins.io move?

Let's start with controller storage (default) on Azure VM: less thing to set up (we can increase the size and performances of the data disk of ci.jenkins.io where JENKINS_HOME is)

@timja
Copy link
Member

timja commented Jan 31, 2025

Let's start with controller storage (default) on Azure VM: less thing to set up (we can increase the size and performances of the data disk of ci.jenkins.io where JENKINS_HOME is)

I think default storage uses remoting, probably not ideal? but we can try it out

@basil
Copy link
Collaborator

basil commented Jan 31, 2025

Additively aggregating the fetched artifacts as I described above sounds tricky to implement in a CI job. Since all PCT runs are running in parallel, it sounds difficult to find a way to cache the superset of all the artifacts they fetched. Perhaps they could each stash their .m2 directory, then unstash all of the stashes to a single place to combine them all together, and then cache that? Conceptually we should think about how to aggregate the artifacts fetched by all PCT runs, not just a build of the BOM itself.

@timja
Copy link
Member

timja commented Jan 31, 2025

Had a chat to @dduportal before, we're going to try first with caching just the master branch

@dduportal
Copy link
Contributor

Additively aggregating the fetched artifacts as I described above sounds tricky to implement in a CI job. Since all PCT runs are running in parallel, it sounds difficult to find a way to cache the superset of all the artifacts they fetched. Perhaps they could each stash their .m2 directory, then unstash all of the stashes to a single place to combine them all together, and then cache that? Conceptually we should think about how to aggregate the artifacts fetched by all PCT runs, not just a build of the BOM itself.

I understand that we would need to set up a key cache for each stage based on its name (so it can get its cache from previous master branch build)

@jonesbusy
Copy link

jonesbusy commented Feb 2, 2025

Sadly I miss those discussion in the summit since I was in other groups.

We are using jobcacher in our company and works well with Artifactory backend. Also 2 collaborator are maintaining the jobcacher plugin (including myself), so I'm more than happy to give some help for any feature or fix are needed on the jobcacher plugin.

One of the limitation is that jobcacher seems to timeout for very large cache when they need more.than 5mn for upload or download. Seems to be related to CPS timeout.

jenkinsci/jobcacher-plugin#334

@timja
Copy link
Member

timja commented Feb 2, 2025

Have a look through the PoC if you haven’t already: jenkinsci/bom#4351

@dduportal
Copy link
Contributor

I've created #4525 to specifically tracks actions and discussions regarding the job cacher

@jglick
Copy link

jglick commented Feb 10, 2025

the practice of caching workspaces as tarballs has been demonstrated to be practical at scale—something that has not been demonstrated with the artifact caching proxy approach

FWIW CloudBees runs a job vaguely analogous to bom in (PCT) scope and scale which uses an artifact caching proxy rather than workspace caching. The services and infrastructure are of course pretty different.

@dduportal
Copy link
Contributor

Closing this issue as discussed in the team meeting (18 FeB 2025) for the following reasons:

@dduportal
Copy link
Contributor

Closing this issue as discussed in the team meeting (18 FeB 2025) for the following reasons:

* The AWS ACP does seem to better handle the waves of requests. The ~15 BOM build including the one warming it up from scratch did not trigger errors.

* Let's focus our effort on [[ci.jenkins.io] Enable Maven dependencies client-side caching for BOM with Job Cacher #4525](https://github.com/jenkins-infra/helpdesk/issues/4525) which would benefit all BOM builds

Looks like I was wrong: #4545 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants