Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job cacher PoC #4351

Draft
wants to merge 23 commits into
base: master
Choose a base branch
from
Draft

Job cacher PoC #4351

wants to merge 23 commits into from

Conversation

timja
Copy link
Member

@timja timja commented Jan 31, 2025

see jenkins-infra/helpdesk#4442 (comment)

Needs jobcacher plugin installed and then either configured with AWS or Azure Storage (https://plugins.jenkins.io/jobcacher-azure-storage/)

Testing done

Will test on this PR

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@@ -0,0 +1,2 @@
Change this file if you need to invalidate the cache.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be either invalided on exceeding maxCacheSize or if someone changes the file, e.g. incrementing 1 to 2

Jenkinsfile Outdated
cache(
// max cache size in MB, will be reset after exceeding this size
maxCacheSize: 2048
defaultBranch: 'master', caches: [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the current branch has no cache, it will seed its cache from the specified branch. Leave empty to generate a fresh cache for each branch.

timja and others added 3 commits January 31, 2025 09:19
Co-authored-by: Damien Duportal <damien.duportal@gmail.com>
@timja

This comment was marked as resolved.

@timja
Copy link
Member Author

timja commented Jan 31, 2025

Using mvn dependency:go-offline from the existing prep stage, it took:

  • 16 minutes to fill the cache initially
  • 101 seconds to create and upload the cache
  • 30 seconds to download the cache on a subsequent build

The cache is 2.4G big for the prep stage


When run on a per plugin and per line basis it takes up way more space:

On a build using only 1 line with a few failures 119G was cached, 2 over a 1gb, most around the 500mb

root@ci.jenkins.io:/var/lib/jenkins/job-cache$ du -h Tools/bom/PR-4351/cache/* | grep -v K
1.8G    Tools/bom/PR-4351/cache/5ec580390bdc7a61bca935a87a5335b0.tgz
2.4G    Tools/bom/PR-4351/cache/7c98e0830a2eec6c83499a675572912e.tgz

I don't think caching per line is practical so I've removed that although its going to invalidate the cache

@timja
Copy link
Member Author

timja commented Feb 1, 2025

Updated results ^

@basil
Copy link
Member

basil commented Feb 1, 2025

I don't think caching per line is practical

Sure, even caching per repository I would expect most of the tarballs to have similar contents. Not sure how practical it is to consolidate/deduplicate all the PCT tarballs into one at the end of the run, but that would be the most space-efficient.

@timja
Copy link
Member Author

timja commented Feb 2, 2025

One branch failed with this on the latest run:

Found unhandled java.lang.InterruptedException exception:
java.base/java.lang.Object.wait(Native Method)
	hudson.remoting.Request.call(Request.java:[1](https://ci.jenkins.io/job/Tools/job/bom/job/PR-4351/13/pipeline-console/?start-byte=0&selected-node=4489#log-1)79)
	hudson.remoting.Channel.call(Channel.java:1111)
	hudson.FilePath.act(FilePath.java:1[2](https://ci.jenkins.io/job/Tools/job/bom/job/PR-4351/13/pipeline-console/?start-byte=0&selected-node=4489#log-2)28)
	hudson.FilePath.act(FilePath.java:1217)
	hudson.FilePath.exists(FilePath.java:1782)
	PluginClassLoader for jobcacher//jenkins.plugins.jobcacher.ArbitraryFileCache$SaverImpl.save(ArbitraryFileCache.java:[3](https://ci.jenkins.io/job/Tools/job/bom/job/PR-4351/13/pipeline-console/?start-byte=0&selected-node=4489#log-3)76)
	PluginClassLoader for jobcacher//jenkins.plugins.jobcacher.CacheManager.save(CacheManager.java:98)
	PluginClassLoader for jobcacher//jenkins.plugins.jobcacher.pipeline.CacheStepExecution$ExecutionCallback.complete(CacheStepExecution.java:103)
	PluginClassLoader for

@timja
Copy link
Member Author

timja commented Feb 2, 2025

Last one passed, takes about 1 hr 40 when all cached.

Is the naïve approach worth it with all the extra disk space?

Or do we try something else? stashing all repositories, aggregating and then caching that?

@basil
Copy link
Member

basil commented Feb 3, 2025

I think it is definitely worth a shot unless it is prohibitively impractical.

@timja
Copy link
Member Author

timja commented Feb 3, 2025

I think @dduportal you suggested creating another volume for the cache so it won’t fill up the main volume?

@dduportal
Copy link
Contributor

docker exec -ti jenkins df -h /var/jenkins_home
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 503G 320G 159G 67% /var/jenkins_home

I think @dduportal you suggested creating another volume for the cache so it won’t fill up the main volume?

After checking the impact on the controller metrics (see below), I believe we can get started "as it" in Azure. We'll use a dedicated disk in AWS though (I'll update the issue) and/or will switch to S3 buckets.

=> We see a really visible impact when writing the cache, but it's still within the usage boundaries of the current machine so it's fine. The builds reading cache has almost no impact.

Capture d’écran 2025-02-04 à 09 32 20

Jenkinsfile Outdated
maxCacheSize: 3072,
defaultBranch: 'master',
// don't save pull requests, only cache on master branches
// skipSave: env.BRANCH_NAME != 'master',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit before merge

Suggested change
// skipSave: env.BRANCH_NAME != 'master',
skipSave: env.BRANCH_NAME != 'master',

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes think about the reliability: does the build fail (on master branch) if the cache fails to save, but the whole BOM build works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's an issue with the cache no it doesn't fail, if it fails to save then yes I did manage to trigger this exception in one build: #4351 (comment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which looks to be it fails if it takes 5 mins or more to upload the cache

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Since the initial trigger of this effort (besides of course the technical efficiency and cost decrease) was to avoid slowing down or blocking the BOM team and releases, such failure would force them to deal with re-triggering builds.

Do you think it could be feasible to have a separate build in charge of seeding the cache, keeping the BOM build (even on master) to only "read" the cache and decouple both?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with that is the added costs for putting/retrieving from S3

If the S3 bucket is in the same region as the agents, then it costs no bandwidth. It's the case: we do not use multiple AWS regions.
=> We'll only pay for the storage cost which is quite low.

The BOM is one of the rare builds on ci.jio which clearly benefit from caching: the combined costs of "EC2 machines minute not needed thanks to the cache" (a gain of ~1 hour as per @timja first tests) and "the BOM maintainers not blocked and not requiring infra team to restart builds" is clearly worth!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try compressionMethod: 'TAR_ZSTD'? It should provide much better performance for speed and compression

Trying.


Do you think it could be feasible to have a separate build in charge of seeding the cache, keeping the BOM build (even on master) to only "read" the cache and decouple both?

Hmm possibly, maybe a parameterised build, it would need to skip the tests and just download dependencies (but make sure it resolves them all?)

Likely do-able but potentially we try without and see how we go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely do-able but potentially we try without and see how we go?

Fine by me if it is ok for @darinpope @alecharp @basil and @MarkEWaite

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try compressionMethod: 'TAR_ZSTD'? It should provide much better performance for speed and compression

Same problem: https://ci.jenkins.io/job/Tools/job/bom/job/PR-4351/17/pipeline-console/?start-byte=0&selected-node=4438#log-0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't actually test it, I had the compressionMethod in the wrong place, its enabled now.

@basil
Copy link
Member

basil commented Feb 4, 2025

How does the Job Cacher plugin handle concurrency? For example, if two different PCT branches from the same plugin repository but different lines (e.g., jenkinsci/text-finder-plugin on 2.492.x and on 2.479.x) try to update the cache at the same time, will one clobber the updates of the other?

@timja timja removed the weekly-test label Feb 4, 2025
@dduportal
Copy link
Contributor

How does the Job Cacher plugin handle concurrency? For example, if two different PCT branches from the same plugin repository but different lines (e.g., jenkinsci/text-finder-plugin on 2.492.x and on 2.479.x) try to update the cache at the same time, will one clobber the updates of the other?

Same question I have. That's why I proposed to have a distinct "job" or "process" handling the write. It means less frequent updates, but avoid the headache of concurrency write.

@timja
Copy link
Member Author

timja commented Feb 5, 2025

How does the Job Cacher plugin handle concurrency? For example, if two different PCT branches from the same plugin repository but different lines (e.g., jenkinsci/text-finder-plugin on 2.492.x and on 2.479.x) try to update the cache at the same time, will one clobber the updates of the other?

Same question I have. That's why I proposed to have a distinct "job" or "process" handling the write. It means less frequent updates, but avoid the headache of concurrency write.

I've set it to only update cache on the weekly line.

@basil
Copy link
Member

basil commented Feb 5, 2025

I've set it to only update cache on the weekly line.

That would resolve the concurrency concern at the cost of not caching all of the dependencies consumed by tests on non-weekly lines, which is needed in order to avoid test flakiness when (not if) Maven Central happens to be slow or down during the time the non-weekly lines are being tested (such as during a BOM release).

@timja
Copy link
Member Author

timja commented Feb 6, 2025

I've set it to only update cache on the weekly line.

That would resolve the concurrency concern at the cost of not caching all of the dependencies consumed by tests on non-weekly lines, which is needed in order to avoid test flakiness when (not if) Maven Central happens to be slow or down during the time the non-weekly lines are being tested (such as during a BOM release).

Yes although I don't think this approach will scale based on the extra disk space required? Non weekly lines will generally be the same dependencies and will have much fewer changes so should be cached in ACP I would think.

Otherwise we can try the central aggregation of all repositories at the end of a build, and then just archive that and use in the cache.

@timja
Copy link
Member Author

timja commented Feb 6, 2025

Even outside of job cacher I haven't been able to get a green build the last few.

Previous failed for an error from GitHub when fetching something (resolving tags I think)
One before that got an error from the yarn registry.

@basil
Copy link
Member

basil commented Feb 6, 2025

Non weekly lines will generally be the same dependencies and will have much fewer changes so should be cached in ACP I would think.

Once a plugin is pinned, its dependency tree will start to diverge drastically from the one of the same plugin on the weekly line. ACP does not work well for this use case, hence this PR.

@dduportal
Copy link
Contributor

Even outside of job cacher I haven't been able to get a green build the last few.

Previous failed for an error from GitHub when fetching something (resolving tags I think) One before that got an error from the yarn registry.

Today was a "Broken Internet Day" as Cloudflare R2, Microsoft Azure and DockerHub all had major outages. Most probably cause of the failures today.

@dduportal
Copy link
Contributor

@basil @timja I have a feeling that Job Cacher seems not the best fit for this (BOM) use case, like ACP as per the comments above (related to sharing/non sharing dependency sets between branches and builds). Of course I could have misunderstood (my English reading is sometimes not good enough).

I'm wondering about using a pod's PVC in read only, which should contain a build cache in the form of a TAR archive.
If this file is found by the pipeline, then it's un-tared to $HOME/.m2/repository to get the current cache.

Cache seeding would be done by a regular custom build which role would be to generate the dependencies for each cell of the build, aggregate them all (i though a rsync to avoid copying duplicated dependencies) and generate a new tar archive from the aggregated $HOME/.m2.

  • That would make sure the cache is shared between all builds of BOM (PRs, weekly, master and all other builds)
  • If the cache does not cover all the dependencies of a given build, ACP should still be usable to cover for the missing subset, like any other plugin build.
  • The safety (e.g. avoiding cache poisoning) of the process should be covered by the "seeder build"
  • Performance using an Azure file PVC should be good if we stay on "write/read once a TAR archive". We could also do the same with an S3 bucket in EKS

WDYT?

@timja
Copy link
Member Author

timja commented Feb 6, 2025

should work I think

@dduportal
Copy link
Contributor

If no one objects, I'll implement (and test) the new caching on aws.ci.jenkins.io to avoid exhausting Azure credits

@jonesbusy
Copy link
Contributor

If any interest (for bom or in the future), the 5 minutes timeout was fixed and released on https://github.com/jenkinsci/jobcacher-plugin/releases/tag/636.v7b_3a_413b_b_5a_3 so should not cause any more issue with large cache

@timja
Copy link
Member Author

timja commented Feb 12, 2025

If any interest (for bom or in the future), the 5 minutes timeout was fixed and released on https://github.com/jenkinsci/jobcacher-plugin/releases/tag/636.v7b_3a_413b_b_5a_3 so should not cause any more issue with large cache

Thanks I think the main issue is that bom repeats artifacts on many stages but has some different ones. Ideally we want to be able to aggregate a single cache which might be a few gb rather than 150+ gb it is per line if we cache each repository individually

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants