-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instability of artifact-caching-proxy #4442
Comments
Starting analysing logs on ACP side |
For each of the failing requests found in the past 15 days (including each one you folks logged) ACP did report an error due to the upstream, in the following categories:
We also had 1 occurence |
=> The errors are definitively not due to an ACP problem. By design, it "reports" the error. => We could check if we can "retry" the upstream in case of error, I need to recall which cases could be caught |
@MarkEWaite did open a PR , based on a discussion we had during the previous infra meeting: jenkinsci/bom#4095 The goal is to "pre-heat" the cache to decrease the probability of facing these issues |
I haven't heard about any ACP problem with the BOM since the "pre-heat" PR was merged. Of course it might have been (I have not looked with due diligence). Were there any issues in the past 3 weeks @basil @darinpope @Poddingue @MarkEWaite @alecharpentier? For info, this issue is on stale, until we've finished migrated ci.jenkins.io to AWS (see #4313) which implies a new ACP instance (in a new infra). |
Updating: @darinpope reported to have suffered from ACP cache issues last Friday (https://ci.jenkins.io/job/Tools/job/bom/job/master/3786). We can correlate the errors to ACP errors messages in datadog: ![]() All the errors are mapped to HTTP/503 responded by AWS S3:
The The HTTP/503 on AWS could be a "Slow Down" error: https://repost.aws/questions/QU_F-UC6-fSdOYzp-gZSDTvQ/receiving-s3-503-slow-down-responses. |
Update: while working on #4317, I also faced the same issue when testing the BOM builds on the new controller aws.ci.jenkins.io (aimed at replacing ci.jenkins.io ASAP). I had to enable SNI passing to upstream (jenkins-infra/kubernetes-management#6155) to get rid of this error in my tests. Checking the Azure ACP logs (used in production today), I can find the same kind of SSL handshake errors during the bom build failures. => It looks sane to enable this on current ACP in Azure |
jenkins-infra/kubernetes-management#6159 has been applied: @darinpope I let you reopen the issue if you see again new ACP problems. Note: we'll discuss at the FOSDEM 2025 the opportunity to use another caching system but on client side: https://plugins.jenkins.io/jobcacher/ has been mentioned by @basil as a great opportunity. |
In the situation we have here, running PCT for all the plugins, shouldn't the local maven repository be the same for all the plugins? If that is correct, could a "pre-fetch" and the artifacts, caching that in a tarball and use that tarball for all the plugins be enough? |
Mark attempted something similar in jenkinsci/bom#4095 but yes that makes sense if we can stash and unstash the workspace for each build (potentially using a cloud plugin so aren't creating a large amount of traffic between controller and each agent?) |
Sure, I was simply bringing up the default behavior of GitHub actions to illustrate that the practice of caching workspaces as tarballs has been demonstrated to be practical at scale—something that has not been demonstrated with the artifact caching proxy approach. |
The initial idea (tiny step) was to pre-fetch. Since Artifact Caching Proxy has 2 replicas, it clearly is not enough. Note this assumption was made with a set of ~350 parallel PCT tests in the BOM builds. I don't know the exact amount now but it is way higher.
@basil I really like the GitHub cache approach because it has a set of restrictions. I would want to avoid any risk of cache poisoning with anyone opening a PR on a plugin which would add a malicious dependency in the cache. A few note after a first quick issue checks:
=> did I understand correctly? Can I have another pair of eye on my hypothesis please? |
I haven’t used Job Cacher. Another thing to keep in mind is that caching the results of a build of |
For what it’s worth, I have zero caching problems with local BOM and PCT builds. My local Maven cache (which is shared between all my builds) grows without bound until it gets too big, at which point I simply delete it and start over. The same philosophy could be applied to CI builds: additively aggregate the fetched artifacts from all completed builds to a central place, use it as the starting point for all new builds (transferred via tarball), and periodically delete it and start over when it gets too big. The main difference between this and an artifact caching proxy approach is the transfer mechanism: I am advocating for transferring the cache in a single tarball and TCP connection rather than creating hundreds of thousands of connections per build. |
PoC PR up at jenkinsci/bom#4351 If we're happy to try it then would need some storage, I assume Azure for now and switch to AWS as part of ci.jenkins.io move? |
Let's start with controller storage (default) on Azure VM: less thing to set up (we can increase the size and performances of the data disk of ci.jenkins.io where JENKINS_HOME is) |
I think default storage uses remoting, probably not ideal? but we can try it out |
Additively aggregating the fetched artifacts as I described above sounds tricky to implement in a CI job. Since all PCT runs are running in parallel, it sounds difficult to find a way to cache the superset of all the artifacts they fetched. Perhaps they could each stash their |
Had a chat to @dduportal before, we're going to try first with caching just the master branch |
I understand that we would need to set up a key cache for each stage based on its name (so it can get its cache from previous |
Sadly I miss those discussion in the summit since I was in other groups. We are using jobcacher in our company and works well with Artifactory backend. Also 2 collaborator are maintaining the jobcacher plugin (including myself), so I'm more than happy to give some help for any feature or fix are needed on the jobcacher plugin. One of the limitation is that jobcacher seems to timeout for very large cache when they need more.than 5mn for upload or download. Seems to be related to CPS timeout. |
Have a look through the PoC if you haven’t already: jenkinsci/bom#4351 |
I've created #4525 to specifically tracks actions and discussions regarding the job cacher |
FWIW CloudBees runs a job vaguely analogous to |
Closing this issue as discussed in the team meeting (18 FeB 2025) for the following reasons:
|
Looks like I was wrong: #4545 (comment) |
Service(s)
Artifact-caching-proxy
Summary
Bruno had to run the weekly BOM release process five times today (2024-12-06) because of errors like the following:
Could not transfer artifact com.google.crypto.tink:tink:jar:1.10.0 from/to azure-aks-internal (http://artifact-caching-proxy.artifact-caching-proxy.svc.cluster.local:8080/): Premature end of Content-Length delimited message body (expected: 2,322,048; received: 1,572,251)
Here's the issue where he tracked the build numbers so you can see the specific failures:
jenkinsci/bom#4066
I also had similar issues doing a BOM
weekly-test
against a core RC that I'm working on:Since I started working on BOM the past couple of months, this problem seems to be getting worse/more unstable as the weeks progress.
Reproduction steps
Unfortunately, it is not reproducible on demand.
The text was updated successfully, but these errors were encountered: