-
-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Policy for third-party hardware donation #1343
Comments
Anything exotic will be a problem for the hercules agent as it is haskell. Just as a remote builder for buildbot/hydra sould mean the cache key isn't an issue? |
We would still need to trust the build results. |
I don't understand your point? Isn't trusting the build results a given? |
I think we should communicate how builders for different architectures are secured. i.e. Hetzner will have a safer access policies than the machines in someones basement. Then users can decide if they are OK with this. |
Remote builders sounds good. One requirement could be that we are the only admins on the machine. It doesn't prevent physical tampering but reduces the attack surface if the host provider gets hacked. |
I’m not familiar with how the infrastructure you’ve set up for builds vs caching works, but one of the concerns I’ve had when trying to stand up infrastructure for consistently building CUDA packages, and for serving a binary cache, but there can be a lot of traffic between nodes. It was enough of a bottleneck between the three desktops in my basement that I moved over to 10GBe networking for everything, and I’m still saturating it. I don’t know if the remote build protocol takes into account closure size or data locality when deciding which machines should build different things, but there can be a lot of movement on the network, which can be a bottleneck or, in the case of cloud providers, a hefty egress fee. So a couple of questions from me:
I’d love to learn more about any of the challenges you all have faced setting up and maintaining this infrastructure! |
No, it doesn't.
No.
Yes. 1tb, sponsored by Cachix.
Yes, our linux machines are all in HEL1. Though that isn't exactly intentional, we usually choose based on price. We also have two macos builders in FSN1.
Haven't used them. IIRC there way some discussion about Azure stuff in the nixos org, maybe with the infra team or the foundation?
No.
I don't think we've had any real technical challenges so far, basically just limited by funding. Once we started the opencollective we just expanded as the funding increased. Building cuda, rocm, etc is probably going to be the first time we've really needed to give thought to some of these topics. |
See also https://docs.hetzner.com/robot/general/traffic/ Each server comes with 10TB of egress. |
No, for 1Gbit-Links and physical machines, it's unmetered. The traffic limit only applies to VMs. |
To go back on topic, I am recapping this to:
It could be added to the donation page if we agree on this. As a personal note, I would love to see GPU, and MIPS hardware |
In case of esotheric hardware, we also need one or more people doing the work to fix nixpkgs so that we can keep the machine up to date. |
Mips seems a bit dead at this point, even https://mips.com/ now sells your riscv64 cores instead. Some simple gpu runner would be nice for integration tests. |
Continuing from #1335 (comment): @zowoq sorry for the late reply, I'm moving soon and organizing everything has been quite an experience. HetznerA more accurate amount on the Hetzner side: 315 euros / month (though it could be less, at the moment). I have one RX170 instance (hereafter I have one AX102 instance (hereafter I only recently set up monitoring on Hercules CI runs on a schedule, so load increases due to CI starting and load decreases due to CI finishing are somewhat predictable (meaning we could benefit from ephemeral instances). Local machinesMy desktop builders are three machines I have in my basement, and they are all running NixOS. I use them every day for work on Nixpkgs (lots of Common specs:
QuestionsSince the majority of the load on these machines are my running As an example, for each of my PRs, I try to run the following
That's seven instances: luckily I use a MacBook Pro for as my laptop and have a Jetson, so I can build for both those platforms. But I'm curious how others handle this! I built all my local machines because I need them to speed up |
H'mm, the first thing I disliked about cuda-maintainers.cachix.org and the reason I have been reluctant to advertise it is we couldn't publish a clear answer to the question "who has access to the signing keys?" ("who can push to cachix?"). The cloud reliance is surely unnecessary, expensive, and annoying, but makes the billing easy and alleviates the need to expose the keys to more than a few parties. What I'd like as a consumer is for somebody foundation/"association"/"community"-aligned to maintain a physical build farm that one could physically donate hardware to (by parcel) with a transparent policy wrt keys and isolation on the builders ("by consuming from this cache you trust the nvidia kernel modules that run on some of the builders, you trust people from this list who can ssh, you trust this nation state in whose jurisdiction the farm is hosted"). AFAICT nobody on the CUDA team currently has the capacity to spin up something like that |
Not sure we would actually need a 10G uplink. I think quite a few stuff in the NixOS infra also works fine with 1G. I think an upgrade of one of our NixOS builder on x86 to an AX162-R, might give us enough horse power. |
Did you read my comment?
|
This sounds like adding more hardware, while being a bit wasteful, would also fix the issue. Or is there something else blocking? |
We discussed this a few days ago. |
@Mic92 These seem to have a long wait time (weeks/months) and possibly some stability issues (according to reddit). WDTY about a server auction EPYC 7502P (32 cores), 256gb, 2x1.92tb? I'd propose doing a hardware upgrade/shuffle like we did last year: cancel ax41 (build01), move build03 -> build01, add the EPYC as build03. |
Zen3 CPUs also sounds fine. What is the price point? |
Same amount of ram as we have currently and 4 extra cores? |
We currently have 12 cores in build03, so it's 20 extra cores. |
It actually shows here that ax162-r would be available in a few minutes if we order in Germany: https://www.hetzner.com/dedicated-rootserver/ax162-r/configurator/#/check-availability and it's still cheaper than ax162 while giving us 48 cores and ddr5 |
Oops, yes, I mixed up looking at the lower spec amds.
Are you sure it will actually be available in a few minutes?
|
No. Not sure, but do we loose much if we have to wait a bit? The performance difference looks significant for the same price. I am wondering where the "available in a few minutes" comes from then. I would expect this to be an automated process looking at the duration. |
Sometimes, it's easier for organizations or individuals to lend out hardware (rather than open collective). There is an opportunity to gain access to build capacity. And different kinds of hardware (e.g., GPU, Riscv5, MIPS, ...)
Before pursuing this, let's discuss what that would look like.
What are the requirements on our side?
Some threads:
The text was updated successfully, but these errors were encountered: