-
-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Build tracker #52
Comments
Mark, could you please explain the process shortly. I was thinking that it's not allowed to upload manually built packages to conda-forge. |
upload them to your own public anaconda channel. i kinda want to merge the mkl migration first |
This approach seems a little fragile, and more work than needed especially in the long term. There's 16 builds here, one build takes 15-25 minutes on a decent build machine. So in 4-6 hours, all binaries can be built. How about writing a reproducible and well-documented build script and letting a single person with access to a good build server build everything at once? |
15-25 mins.... what kind of machine do you have access to? |
^^^^ its kinda a serious question. I'm genuinely interested in knowing. |
Desktop 12-core / 32 GB. And we have a few 32-core / 128 GB dev servers. |
If that script is all there is to it, that's quite nice. Also let's make sure it doesn't blow up disk space completely. Are they all separate Docker images, and how much space does one take? |
the docker images do take space. my build machine,'s root storage is full and docket is complaining about disk space for me. |
@rgommers I'm not really sure what happened but my build time was closer to 8 hours on a |
That seems really long. There is a lot of stuff to turn on and off, so I probably cheated here by turning a few of the expensive things off - in particular using The >1 hr one is the mobile build. Regular Linux builds are in the 20-50 min range. 8 hours doesn't sound right, something must be misconfigured for you. |
Thanks for the info. I think I'm not using all CPUs. I can see that only 2 are being used. I think I probably need to send an other environment variable through. I'll have to see how I can pass it through. |
I have started one build with |
Maxjobs is set by CPU_COUNT it seems. A conda-forge variable that sets the number of processors for CI builds. |
I'm running
To see how much it helps. I think it should help alot. Thanks for helping debug. I'm updating the suggested script too. |
That's still not quite right for me. I get: $ nproc
2
$ nproc --all
24 The optimal number is the number of physical cores I think. |
It takes about 15 minutes before the build actually starts - downloading + solving the build env + cloning the repo is very slow. And probably at the end it'll take another 10 minutes, IIRC another conda solve is needed to set up the test env. And no tests are run other than |
do you have a dual CPU machine? or a big little architecture machine. I've found that hyperthreading does somewhat help when compiling small files. |
i can update the instrucyions when i get back to my computer to divide by two. |
So okay, this does take a painfully long time. It took almost exactly 2 hours for me using 10 cores. There's no good way I can see to get a detailed breakdown of that, but my estimate based on peaking at the terminal output during meetings and the resource usage output in the build log:
A large part of the build time seems to be spent building Caffe2. @IvanYashchuk was looking at disabling that, hopefully it's possible (but probably nontrivial). The number of CUDA architectures to build for is the main difference between a dev build and a conda package build. For the former it's just the architecture of the GPU installed in the machine plus PTX, for the latter it's 7 or 8 architectures. The build used 7.3 GB of disk space for Details on usage statistics from the build log:
So it looks like if we use half the cores on a 32-core machine, the total time will be about 1hr 30min. So 16 builds takes ~24 hrs and 160 GB of space. It's a bit painful, but still preferable to build everything on a single machine imho - less chances for mistakes to leak in. EDIT: for completeness, the shell script to prep the build to ensure I don't pick up env vars from my default config:
|
Awesome analysis, @rgommers. Thank you for taking the time to put this together. Do you see any potential option forward for getting these builds under the Azure CI timeout? Or other options for automatically building them on the cloud? |
Why is it important to disable Caffe2 builds? Do you mean trying to share the stuff under torch_cpu in caffee between builds? |
As for why we don't use shallow clones, it is because they don't end up being too shallow. And finally, it seems to be hard to checkout the tag. I raised the issue to boa |
Probably not on 2 cores in 6 hours, especially for CUDA 11, unless Caffe2 can be disabled. The list of architectures keeps growing, for 11.2 it's:
It may be possibe to prune that, but then there's deviations from the official package. A Tesla P100 or P4 (see https://developer.nvidia.com/cuda-gpus) is still in use I think, and it'll be hard to predict for users what GPUs are supported by what conda packages then. Hooking in a custom builder so CI can be triggered is of course possible (and planned for GPU testing), but both work to implement and costly. PyTorch is not unique here. Other packages like Qt and TensorFlow have the same problem that they take too long to build. That's more a question for the conda-forge core team; I'm not aware of a plan for this.
No, actually disable. There's a lot that's being built there that's not needed - either relevant for mobile build, or just leftovers. Example: there's |
@rgommers i'm not sure what the path forward is for today. Are you able to build everything over 24/48 hours? Otherwise, I can keep chugging along building on my servers over nights. |
@benjaminrwilson they are seperated. You can locally run I've just been manually running them one at a time. rgrommers is trying to find a "more effecient" way to do this for long term maintainability. |
I then upload them to my anaconda channel. Later conda-forge can download the packages from there and upload it to their own channel. |
Are the actual gpu-specific builds being separated too? Maybe I'm missing something, but it looks like the runs are split by CUDA version, but not architecture as well:
|
I'm wrapping up things to go on holiday next week, so it's probably best if I didn't say yes.
Indeed, I don't think there's a good way to do this. |
Ah i see. TBH: This isn't the scope of this issue tracker. I really just want to get builds for pytorch 1.9 out there with GPU support. If you think it is worth us discussing this please open a new issue to improve the build process. We can then define goals and have a more focused discussion. |
Ok. I got my hands on a system that I might reasonable be able to leave running for a day or two alone. I've started with the MKL2021 builds on it and I'll report tomorrow if it is doing well. |
3 builds = 10 hours. 16 builds = 54 hours. I guess it should be done by the end of the weekend. |
@isuruf MKL 2021 builds are complete. Is that enough for this? I might not have enough spare compute (or free time) to build for MKL 2020. Are you able to upload to conda-forge from my channel? |
How are things standing with the upload of the artefacts? 🙃 |
generally. people might be busy. i try to ping once a week. or once every two weeks. isuruf is very motivated. I'm sure he hasn't forgotten about this. |
@hmaarrfk, can you mark the _1 builds with a label? |
Added. The label is |
Disabling enough stuff gets things almost passing. But as expected, when building for many GPUs at once, it does take longer and longer. Honestly, I would like to keep building for multiple GPUs. On my systems, I often pair up a GT 1030 with a newer GPU to utilize the newer GPU at the full extent (as opposed to using it for X11 as well) |
Yeah, I completely get that. I guess one thing for us to consider is:
Additionally, we could consider is |
ok, i'm trying multi threading. |
Looks like the option is available with |
maybe then we try not to ccompress. |
I guess it is time to wait 6 hours. |
For what its worth, I'm rebuilding for mkl 2020 but who knows if it will finish. Maybe they will be done by next week. |
I've uploaded _1 builds |
@isuruf are you able to upload the |
I think that the mkl2021 migration is complete and we can likely just avoid uploading the |
We will be starting a cuda build run after
#44
is merged. This table should help track the builds.
CUDA Build Tracker
MKL 2021: All 16 builds have been uploaded to conda-forge.
==1.9.0-*_0
==1.9.0-*_1
forge
https://www.tablesgenerator.com/markdown_tables#
Channels
The text was updated successfully, but these errors were encountered: