Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hydra: build CUDA packages for the CUDA team #1335

Merged
merged 1 commit into from
Jul 4, 2024
Merged

Conversation

zimbatm
Copy link
Member

@zimbatm zimbatm commented Jul 2, 2024

I don't know how much resources that would take, but we could give it a go.

CUDA packages are slow to build, and not cached by upstream due to upstream not building unfree packages. This could help the team quite a bit.

terraform/hydra-projects.tf Show resolved Hide resolved
terraform/hydra-projects.tf Show resolved Hide resolved
terraform/hydra-projects.tf Outdated Show resolved Hide resolved
input {
name = "nixpkgs"
type = "git"
value = "https://github.com/NixOS/nixpkgs.git nixos-unstable"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SomeoneSerge do you want to use another branch, like a staging-cuda or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried pushing big changes to cuda-updates first and building it prior to merging into master, but we kept coming back to targeting master directly. In our hercules we build master + nixos-unstable + the latest release: this way by the time CI starts a round of nixos-unstable, some of will have been cached by the master job. This alleviates some of the pain of nixos-unstable advancing without testing CUDA

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related idea would be to pull from nix-community/nixpkgs and give the CUDA team access to it. That would limit the risks compared to giving everybody push access to NixOS/nixpkgs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we're building nixos-unstable-small

@zowoq
Copy link
Contributor

zowoq commented Jul 2, 2024

I don't know how much resources that would take, but we could give it a go.

No objection. Maybe if this ends up too much for our current hardware we could consider upgrading.

terraform/hydra-projects.tf Outdated Show resolved Hide resolved
@Mic92
Copy link
Member

Mic92 commented Jul 3, 2024

I don't know how much resources that would take, but we could give it a go.

No objection. Maybe if this ends up too much for our current hardware we could consider upgrading.

If you tell companies that all they need to do to get CUDA packages for NixOS is to pay some monthly donations, than I am sure we get the funding pretty quick. Also we still haven't asked Hetzner for the discount they are offering to the NixOS foundation. This way we would probably still save money with bigger hardware.

input {
name = "supportedSystems"
type = "nix"
value = "[ \"x86_64-linux\" \"aarch64-linux\" \"aarch64-darwin\" ]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linux only?

value             = "[ \"x86_64-linux\" \"aarch64-linux\" ]"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now it's "x86_64-linux" only since upstream has set that value as a default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I mean we should restrict it here to linux only as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support both x86_64 and aarch64 linux, I'll update the release file

@zimbatm
Copy link
Member Author

zimbatm commented Jul 3, 2024

terraform/hydra-projects.tf Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zowoq so I'm considering just building the whole import <nixpkgs> { config.cudaSupport = true; }; this gives me something like

❯ nix-eval-jobs --expr 'import ./pkgs/top-level/release-cuda.nix { }' --force-recurse | wc -l
...
# eval errors, eval errors
...
138452

Does that sound unreasonable? I could in principle come up with a smaller, curated set of jobs.

Hexa also raises the concern that this would be effectively mirroring the NixOS Hydra:

hexa (UTC+1)
so all of them
if there was a cache behind nix-community hydra, than you'd be mirroring cache.nixos.org effectively
SomeoneSerge (UTC+3)
Yeah... Ideally we'd have a solution that evaluates the full DAGs for vanilla and cuda nixpkgs, starts building cuda from the leaves (ehhh, the roots), and always suspends the build if it hash matches the vanilla hash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import <nixpkgs> { config.cudaSupport = true; }

TBH I doubt we have the capacity to handle that without it being detrimental to the other projects using this CI. Currently we only have two builders for linux that are shared by buildbot, hercules, hydra:

### `build03`

- Provider: Hetzner
- CPU: AMD Ryzen 9 3900 12-Core Processor
- RAM: 128GB DDR4 ECC
- Drives: 2 x 1.92 TB NVME in RAID 1

### `build04`

- Provider: Hetzner
- Instance type: [RX170](https://www.hetzner.com/dedicated-rootserver/rx170)
- CPU: Ampere Altra Q80-30 80-Core Processor
- RAM: 128GB DDR4 ECC
- Drives: 2 x 960 GB NVME in RAID 0

If you do want to build everything maybe we could have dedicated machines just for this package set? Not sure if raising the money for that is feasible?

if there was a cache behind nix-community hydra, than you'd be mirroring cache.nixos.org effectively

Not sure if I've misunderstood or not, we push everything to cachix but it skips existing nixos cache paths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I've misunderstood or not, we push everything to cachix but it skips existing nixos cache paths.

The concern is that if there is a phase shift between NixOS and the Community Hydras, and the latter starts building a certain derivation from a certain commit before the former, we'll have wasted some storage and compute

Copy link
Contributor

@SomeoneSerge SomeoneSerge Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I doubt we have the capacity to handle that without it being detrimental to the other projects using this CI. Currently we only have two builders for linux that are shared by buildbot, hercules, hydra:

Roger that. I'll push a smaller jobset tomorrow, based on what we've been building in https://github.com/SomeoneSerge/nixpkgs-cuda-ci.

How is the community builder funded? @ConnorBaker was asking on matrix if there's an opencollective

Copy link
Contributor

@zowoq zowoq Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is that if there is a phase shift between NixOS and the Community Hydras, and the latter starts building a certain derivation from a certain commit before the former, we'll have wasted some storage and compute

Yeah, we can't really avoid that with this approach. Could try adding the free deps as blockers for nixos-unstable or could try doing something similar in a repo here, flake update PRs with max-jobs = 0 so merging is blocked if they aren't cached?

How is the community builder funded? ConnorBaker was asking on matrix if there's an opencollective

We have an opencollective: https://opencollective.com/nix-community

@ConnorBaker
Copy link

I don't know how much resources that would take, but we could give it a go.

No objection. Maybe if this ends up too much for our current hardware we could consider upgrading.

If you tell companies that all they need to do to get CUDA packages for NixOS is to pay some monthly donations, than I am sure we get the funding pretty quick. Also we still haven't asked Hetzner for the discount they are offering to the NixOS foundation. This way we would probably still save money with bigger hardware.

They offer discounts? My Hetzner bill for part of the CUDA CI is like $400; I’d love to consolidate some of that stuff under the community, especially if you can get a discount and we can all benefit from it!

@zimbatm
Copy link
Member Author

zimbatm commented Jul 4, 2024

Merging the current state. We can still do follow-up PRs afterwards!

@zimbatm zimbatm added this pull request to the merge queue Jul 4, 2024
Merged via the queue into master with commit 4c757f9 Jul 4, 2024
38 checks passed
@zimbatm zimbatm deleted the hydra-nixpkgs-cuda branch July 4, 2024 10:53
@zimbatm
Copy link
Member Author

zimbatm commented Jul 5, 2024

They offer discounts? My Hetzner bill for part of the CUDA CI is like $400; I’d love to consolidate some of that stuff under the community, especially if you can get a discount and we can all benefit from it!

Yes, but they ran out of the discount budget for this year. We'll have to contact them again next year.

@zimbatm
Copy link
Member Author

zimbatm commented Jul 5, 2024

If you want to donate hardware to the cause, we are discussing what the requirements would be in #1343

@zowoq
Copy link
Contributor

zowoq commented Jul 5, 2024

My Hetzner bill for part of the CUDA CI is like $400

Could you go into some detail please? What hardware, what is built and what is the utilization like?

@zowoq
Copy link
Contributor

zowoq commented Jul 15, 2024

I've reverted this as it had been interfering with our other CI builds.

I'll see if I can find a way of running these builds without causing problems for our other users.

ryan4yin added a commit to ryan4yin/nix-config that referenced this pull request Nov 11, 2024
Enkaiyuegure added a commit to Enkaiyuegure/flakes that referenced this pull request Nov 18, 2024
- Add pre-commit-hooks.cachix.org
- Add cache.lix.system
- Delete cuda cachix due to it is cached at nix-community.cachix.org now -- nix-community/infra#1335
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants