Job cacher PoC #4351

timja · 2025-01-31T09:00:56Z

see jenkins-infra/helpdesk#4442 (comment)

Needs jobcacher plugin installed and then either configured with AWS or Azure Storage (https://plugins.jenkins.io/jobcacher-azure-storage/)

Testing done

Will test on this PR

Submitter checklist

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests - that demonstrates feature works or fixes the issue

Jenkinsfile

timja · 2025-01-31T09:03:14Z

.repository-cache-marker

@@ -0,0 +1,2 @@
+Change this file if you need to invalidate the cache.


will be either invalided on exceeding maxCacheSize or if someone changes the file, e.g. incrementing 1 to 2

timja · 2025-01-31T09:03:35Z

Jenkinsfile

+            cache(
+                // max cache size in MB, will be reset after exceeding this size
+                maxCacheSize: 2048
+                defaultBranch: 'master', caches: [


If the current branch has no cache, it will seed its cache from the specified branch. Leave empty to generate a fresh cache for each branch.

Jenkinsfile

Co-authored-by: Damien Duportal <damien.duportal@gmail.com>

timja · 2025-01-31T15:07:21Z

Using mvn dependency:go-offline from the existing prep stage, it took:

16 minutes to fill the cache initially
101 seconds to create and upload the cache
30 seconds to download the cache on a subsequent build

The cache is 2.4G big for the prep stage

When run on a per plugin and per line basis it takes up way more space:

On a build using only 1 line with a few failures 119G was cached, 2 over a 1gb, most around the 500mb

root@ci.jenkins.io:/var/lib/jenkins/job-cache$ du -h Tools/bom/PR-4351/cache/* | grep -v K
1.8G    Tools/bom/PR-4351/cache/5ec580390bdc7a61bca935a87a5335b0.tgz
2.4G    Tools/bom/PR-4351/cache/7c98e0830a2eec6c83499a675572912e.tgz

I don't think caching per line is practical so I've removed that although its going to invalidate the cache

timja · 2025-02-01T08:01:51Z

Updated results ^

basil · 2025-02-01T08:17:14Z

I don't think caching per line is practical

Sure, even caching per repository I would expect most of the tarballs to have similar contents. Not sure how practical it is to consolidate/deduplicate all the PCT tarballs into one at the end of the run, but that would be the most space-efficient.

timja · 2025-02-02T08:24:05Z

One branch failed with this on the latest run:

Found unhandled java.lang.InterruptedException exception:
java.base/java.lang.Object.wait(Native Method)
	hudson.remoting.Request.call(Request.java:[1](https://ci.jenkins.io/job/Tools/job/bom/job/PR-4351/13/pipeline-console/?start-byte=0&selected-node=4489#log-1)79)
	hudson.remoting.Channel.call(Channel.java:1111)
	hudson.FilePath.act(FilePath.java:1[2](https://ci.jenkins.io/job/Tools/job/bom/job/PR-4351/13/pipeline-console/?start-byte=0&selected-node=4489#log-2)28)
	hudson.FilePath.act(FilePath.java:1217)
	hudson.FilePath.exists(FilePath.java:1782)
	PluginClassLoader for jobcacher//jenkins.plugins.jobcacher.ArbitraryFileCache$SaverImpl.save(ArbitraryFileCache.java:[3](https://ci.jenkins.io/job/Tools/job/bom/job/PR-4351/13/pipeline-console/?start-byte=0&selected-node=4489#log-3)76)
	PluginClassLoader for jobcacher//jenkins.plugins.jobcacher.CacheManager.save(CacheManager.java:98)
	PluginClassLoader for jobcacher//jenkins.plugins.jobcacher.pipeline.CacheStepExecution$ExecutionCallback.complete(CacheStepExecution.java:103)
	PluginClassLoader for

Jenkinsfile

timja · 2025-02-02T10:42:31Z

Last one passed, takes about 1 hr 40 when all cached.

Is the naïve approach worth it with all the extra disk space?

Or do we try something else? stashing all repositories, aggregating and then caching that?

basil · 2025-02-03T19:14:23Z

I think it is definitely worth a shot unless it is prohibitively impractical.

timja · 2025-02-03T19:16:09Z

I think @dduportal you suggested creating another volume for the cache so it won’t fill up the main volume?

dduportal · 2025-02-04T08:35:15Z

docker exec -ti jenkins df -h /var/jenkins_home
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 503G 320G 159G 67% /var/jenkins_home

I think @dduportal you suggested creating another volume for the cache so it won’t fill up the main volume?

After checking the impact on the controller metrics (see below), I believe we can get started "as it" in Azure. We'll use a dedicated disk in AWS though (I'll update the issue) and/or will switch to S3 buckets.

=> We see a really visible impact when writing the cache, but it's still within the usage boundaries of the current machine so it's fine. The builds reading cache has almost no impact.

Jenkinsfile

timja · 2025-02-04T09:57:52Z

Jenkinsfile

+                maxCacheSize: 3072,
+                defaultBranch: 'master',
+                // don't save pull requests, only cache on master branches
+                // skipSave: env.BRANCH_NAME != 'master',


Commit before merge

Suggested change

// skipSave: env.BRANCH_NAME != 'master',

skipSave: env.BRANCH_NAME != 'master',

This makes think about the reliability: does the build fail (on master branch) if the cache fails to save, but the whole BOM build works?

If there's an issue with the cache no it doesn't fail, if it fails to save then yes I did manage to trigger this exception in one build: #4351 (comment).

Which looks to be it fails if it takes 5 mins or more to upload the cache

🤔 Since the initial trigger of this effort (besides of course the technical efficiency and cost decrease) was to avoid slowing down or blocking the BOM team and releases, such failure would force them to deal with re-triggering builds.

Do you think it could be feasible to have a separate build in charge of seeding the cache, keeping the BOM build (even on master) to only "read" the cache and decouple both?

My concern with that is the added costs for putting/retrieving from S3

If the S3 bucket is in the same region as the agents, then it costs no bandwidth. It's the case: we do not use multiple AWS regions.
=> We'll only pay for the storage cost which is quite low.

The BOM is one of the rare builds on ci.jio which clearly benefit from caching: the combined costs of "EC2 machines minute not needed thanks to the cache" (a gain of ~1 hour as per @timja first tests) and "the BOM maintainers not blocked and not requiring infra team to restart builds" is clearly worth!

Did you try compressionMethod: 'TAR_ZSTD'? It should provide much better performance for speed and compression

Trying.

Do you think it could be feasible to have a separate build in charge of seeding the cache, keeping the BOM build (even on master) to only "read" the cache and decouple both?

Hmm possibly, maybe a parameterised build, it would need to skip the tests and just download dependencies (but make sure it resolves them all?)

Likely do-able but potentially we try without and see how we go?

Likely do-able but potentially we try without and see how we go?

Fine by me if it is ok for @darinpope @alecharp @basil and @MarkEWaite

Did you try compressionMethod: 'TAR_ZSTD'? It should provide much better performance for speed and compression

Same problem: https://ci.jenkins.io/job/Tools/job/bom/job/PR-4351/17/pipeline-console/?start-byte=0&selected-node=4438#log-0

This didn't actually test it, I had the compressionMethod in the wrong place, its enabled now.

Jenkinsfile

basil · 2025-02-04T17:27:09Z

How does the Job Cacher plugin handle concurrency? For example, if two different PCT branches from the same plugin repository but different lines (e.g., jenkinsci/text-finder-plugin on 2.492.x and on 2.479.x) try to update the cache at the same time, will one clobber the updates of the other?

dduportal · 2025-02-05T08:46:36Z

How does the Job Cacher plugin handle concurrency? For example, if two different PCT branches from the same plugin repository but different lines (e.g., jenkinsci/text-finder-plugin on 2.492.x and on 2.479.x) try to update the cache at the same time, will one clobber the updates of the other?

Same question I have. That's why I proposed to have a distinct "job" or "process" handling the write. It means less frequent updates, but avoid the headache of concurrency write.

timja · 2025-02-05T09:02:50Z

How does the Job Cacher plugin handle concurrency? For example, if two different PCT branches from the same plugin repository but different lines (e.g., jenkinsci/text-finder-plugin on 2.492.x and on 2.479.x) try to update the cache at the same time, will one clobber the updates of the other?

Same question I have. That's why I proposed to have a distinct "job" or "process" handling the write. It means less frequent updates, but avoid the headache of concurrency write.

I've set it to only update cache on the weekly line.

basil · 2025-02-05T16:37:37Z

I've set it to only update cache on the weekly line.

That would resolve the concurrency concern at the cost of not caching all of the dependencies consumed by tests on non-weekly lines, which is needed in order to avoid test flakiness when (not if) Maven Central happens to be slow or down during the time the non-weekly lines are being tested (such as during a BOM release).

timja · 2025-02-06T10:10:46Z

I've set it to only update cache on the weekly line.

That would resolve the concurrency concern at the cost of not caching all of the dependencies consumed by tests on non-weekly lines, which is needed in order to avoid test flakiness when (not if) Maven Central happens to be slow or down during the time the non-weekly lines are being tested (such as during a BOM release).

Yes although I don't think this approach will scale based on the extra disk space required? Non weekly lines will generally be the same dependencies and will have much fewer changes so should be cached in ACP I would think.

Otherwise we can try the central aggregation of all repositories at the end of a build, and then just archive that and use in the cache.

timja · 2025-02-06T11:21:31Z

Even outside of job cacher I haven't been able to get a green build the last few.

Previous failed for an error from GitHub when fetching something (resolving tags I think)
One before that got an error from the yarn registry.

basil · 2025-02-06T17:25:42Z

Non weekly lines will generally be the same dependencies and will have much fewer changes so should be cached in ACP I would think.

Once a plugin is pinned, its dependency tree will start to diverge drastically from the one of the same plugin on the weekly line. ACP does not work well for this use case, hence this PR.

dduportal · 2025-02-06T18:58:14Z

Even outside of job cacher I haven't been able to get a green build the last few.

Previous failed for an error from GitHub when fetching something (resolving tags I think) One before that got an error from the yarn registry.

Today was a "Broken Internet Day" as Cloudflare R2, Microsoft Azure and DockerHub all had major outages. Most probably cause of the failures today.

dduportal · 2025-02-06T19:07:37Z

@basil @timja I have a feeling that Job Cacher seems not the best fit for this (BOM) use case, like ACP as per the comments above (related to sharing/non sharing dependency sets between branches and builds). Of course I could have misunderstood (my English reading is sometimes not good enough).

I'm wondering about using a pod's PVC in read only, which should contain a build cache in the form of a TAR archive.
If this file is found by the pipeline, then it's un-tared to $HOME/.m2/repository to get the current cache.

Cache seeding would be done by a regular custom build which role would be to generate the dependencies for each cell of the build, aggregate them all (i though a rsync to avoid copying duplicated dependencies) and generate a new tar archive from the aggregated $HOME/.m2.

That would make sure the cache is shared between all builds of BOM (PRs, weekly, master and all other builds)
If the cache does not cover all the dependencies of a given build, ACP should still be usable to cover for the missing subset, like any other plugin build.
The safety (e.g. avoiding cache poisoning) of the process should be covered by the "seeder build"
Performance using an Azure file PVC should be good if we stay on "write/read once a TAR archive". We could also do the same with an S3 bucket in EKS

WDYT?

timja · 2025-02-06T21:37:32Z

should work I think

dduportal · 2025-02-07T08:36:55Z

If no one objects, I'll implement (and test) the new caching on aws.ci.jenkins.io to avoid exhausting Azure credits

jonesbusy · 2025-02-12T07:27:25Z

If any interest (for bom or in the future), the 5 minutes timeout was fixed and released on https://github.com/jenkinsci/jobcacher-plugin/releases/tag/636.v7b_3a_413b_b_5a_3 so should not cause any more issue with large cache

timja · 2025-02-12T07:30:48Z

If any interest (for bom or in the future), the 5 minutes timeout was fixed and released on https://github.com/jenkinsci/jobcacher-plugin/releases/tag/636.v7b_3a_413b_b_5a_3 so should not cause any more issue with large cache

Thanks I think the main issue is that bom repeats artifacts on many stages but has some different ones. Ideally we want to be able to aggregate a single cache which might be a few gb rather than 150+ gb it is per line if we cache each repository individually

Job cacher PoC

43c892f

timja commented Jan 31, 2025

View reviewed changes

Jenkinsfile Outdated Show resolved Hide resolved

timja commented Jan 31, 2025

View reviewed changes

Jenkinsfile Outdated Show resolved Hide resolved

timja commented Jan 31, 2025

View reviewed changes

Jenkinsfile Show resolved Hide resolved

timja mentioned this pull request Jan 31, 2025

Instability of artifact-caching-proxy jenkins-infra/helpdesk#4442

Closed

timja and others added 3 commits January 31, 2025 09:19

Update Jenkinsfile

8d32a57

Co-authored-by: Damien Duportal <damien.duportal@gmail.com>

Tweaks after testing locally

49f8ce2

Fmt

922618f

This comment was marked as resolved.

Sign in to view

Cache per repository

4fc2c68

timja added the weekly-test label Jan 31, 2025

timja added 3 commits January 31, 2025 16:23

Run in full

231de43

Fix repo path

27a61d0

Remove per line cache

1928633

Fix missing bracket

8e2c2ac

Merge branch 'master' into job-cacher-poc

5eaffb4

Merge branch 'master' into job-cacher-poc

d110e8d

timja commented Feb 2, 2025

View reviewed changes

Jenkinsfile Outdated Show resolved Hide resolved

Update Jenkinsfile

c70f328

dduportal mentioned this pull request Feb 4, 2025

[ci.jenkins.io] Enable Maven dependencies client-side caching for BOM with Job Cacher jenkins-infra/helpdesk#4525

Open

7 tasks

dduportal reviewed Feb 4, 2025

View reviewed changes

Jenkinsfile Outdated Show resolved Hide resolved

Update Jenkinsfile

1b4f8f2

timja commented Feb 4, 2025

View reviewed changes

timja mentioned this pull request Feb 4, 2025

Failed to create cache java.lang.InterruptedException jenkinsci/jobcacher-plugin#334

Closed

timja commented Feb 4, 2025

View reviewed changes

Jenkinsfile Show resolved Hide resolved

Update Jenkinsfile

966274d

timja removed the weekly-test label Feb 4, 2025

timja added 6 commits February 4, 2025 22:03

Add update cache only functionality

11b55d0

Maybe?

fb832b7

Message

d7f8ab6

Missed one place

68a2f27

Fix cache compression location

8d95f0f

Skip saving JUnit results when updating cache

4699a7a

Merge branch 'master' into job-cacher-poc

f13c081

Merge branch 'master' into job-cacher-poc

bd0300f

Disable launchable in cache update

1f195bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job cacher PoC #4351

Job cacher PoC #4351

timja commented Jan 31, 2025

timja Jan 31, 2025

timja Jan 31, 2025

This comment was marked as resolved.

timja commented Jan 31, 2025 •

edited

Loading

timja commented Feb 1, 2025

basil commented Feb 1, 2025

timja commented Feb 2, 2025

timja commented Feb 2, 2025

basil commented Feb 3, 2025

timja commented Feb 3, 2025

dduportal commented Feb 4, 2025

timja Feb 4, 2025

dduportal Feb 4, 2025

timja Feb 4, 2025

timja Feb 4, 2025

dduportal Feb 4, 2025

dduportal Feb 4, 2025

timja Feb 4, 2025

dduportal Feb 4, 2025

timja Feb 4, 2025

timja Feb 5, 2025

basil commented Feb 4, 2025

dduportal commented Feb 5, 2025

timja commented Feb 5, 2025

basil commented Feb 5, 2025

timja commented Feb 6, 2025 •

edited

Loading

timja commented Feb 6, 2025

basil commented Feb 6, 2025

dduportal commented Feb 6, 2025

dduportal commented Feb 6, 2025

timja commented Feb 6, 2025

dduportal commented Feb 7, 2025

jonesbusy commented Feb 12, 2025

timja commented Feb 12, 2025

		@@ -0,0 +1,2 @@
		Change this file if you need to invalidate the cache.

	// skipSave: env.BRANCH_NAME != 'master',
	skipSave: env.BRANCH_NAME != 'master',

Job cacher PoC #4351

Are you sure you want to change the base?

Job cacher PoC #4351

Conversation

timja commented Jan 31, 2025

Testing done

Submitter checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

timja commented Jan 31, 2025 • edited Loading

timja commented Feb 1, 2025

basil commented Feb 1, 2025

timja commented Feb 2, 2025

timja commented Feb 2, 2025

basil commented Feb 3, 2025

timja commented Feb 3, 2025

dduportal commented Feb 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

basil commented Feb 4, 2025

dduportal commented Feb 5, 2025

timja commented Feb 5, 2025

basil commented Feb 5, 2025

timja commented Feb 6, 2025 • edited Loading

timja commented Feb 6, 2025

basil commented Feb 6, 2025

dduportal commented Feb 6, 2025

dduportal commented Feb 6, 2025

timja commented Feb 6, 2025

dduportal commented Feb 7, 2025

jonesbusy commented Feb 12, 2025

timja commented Feb 12, 2025

timja commented Jan 31, 2025 •

edited

Loading

timja commented Feb 6, 2025 •

edited

Loading