CUDA Build tracker #52

hmaarrfk · 2021-07-25T13:33:07Z

We will be starting a cuda build run after
#44

is merged. This table should help track the builds.

CUDA Build Tracker

MKL 2021: All 16 builds have been uploaded to conda-forge.

	Configuration	MKL 2020 16/16	MKL 2021 (16/16)
		`==1.9.0-*_0`	`==1.9.0-*_1`
		label: `forge`	On conda-forge
1	cuda 10.2 python3.6.	ramonaoptics	conda-forge
2	cuda 10.2 python3.7.	ramonaoptics	conda-forge
3	cuda 10.2 python3.8.	ramonaoptics	conda-forge
4	cuda 10.2 python3.9.	ramonaoptics	conda-forge
5	cuda 11.0 python3.6.	ramonaoptics	conda-forge
6	cuda 11.0 python3.7.	ramonaoptics	conda-forge
7	cuda 11.0 python3.8.	ramonaoptics	conda-forge
8	cuda 11.0 python3.9.	ramonaoptics	conda-forge
9	cuda 11.1 python3.6.	ramonaoptics	conda-forge
10	cuda 11.1 python3.7.	ramonaoptics	conda-forge
11	cuda 11.1 python3.8.	ramonaoptics	conda-forge
12	cuda 11.1 python3.9.	ramonaoptics	conda-forge
13	cuda 11.2 python3.6.	ramonaoptics	conda-forge
14	cuda 11.2 python3.7.	ramonaoptics	conda-forge
15	cuda 11.2 python3.8.	ramonaoptics	conda-forge
16	cuda 11.2 python3.9.	ramonaoptics	conda-forge

https://www.tablesgenerator.com/markdown_tables#

Channels

ramonaoptics

The text was updated successfully, but these errors were encountered:

IvanYashchuk · 2021-07-26T05:07:30Z

Mark, could you please explain the process shortly. I was thinking that it's not allowed to upload manually built packages to conda-forge.
People would build the listed CUDA builds and upload it to some storage, then the feedstock maintainers would upload it manually to conda-forge channel, right?

hmaarrfk · 2021-07-26T05:38:51Z

upload them to your own public anaconda channel.

i kinda want to merge the mkl migration first

rgommers · 2021-07-26T06:49:46Z

This approach seems a little fragile, and more work than needed especially in the long term. There's 16 builds here, one build takes 15-25 minutes on a decent build machine. So in 4-6 hours, all binaries can be built. How about writing a reproducible and well-documented build script and letting a single person with access to a good build server build everything at once?

hmaarrfk · 2021-07-26T12:17:55Z

#!/usr/env/bin bash

set -ex
conda activate base
# Ensure that the anaconda command exists for uploading
which anaconda

docker system prune --force
configs=$(find .ci_support/ -type f -name '*cuda_compiler_version[^nN]*' -printf "%p ")
anaconda upload  --skip build_artifacts/linux-64/pytorch*

# Assuming a powerful enough machine with many cores
# 10 seems to be a good point where things don't run out of RAM too much.
export CPU_COUNT=10

for config_filename in $configs; do
    filename=$(basename ${config_filename})
    config=${filename%.*}
    if [ -f build_artifacts/conda-forge-build-done-${config} ]; then
        echo skipped $config
        continue
    fi

    python build-locally.py $config
    # docker images get quite big clean them up after each build to save your disk....
    docker system prune --force
    anaconda upload  --skip build_artifacts/linux-64/pytorch*
done

15-25 mins.... what kind of machine do you have access to?

hmaarrfk · 2021-07-26T12:18:15Z

^^^^ its kinda a serious question. I'm genuinely interested in knowing.

rgommers · 2021-07-26T15:58:18Z

15-25 mins.... what kind of machine do you have access to?

Desktop 12-core / 32 GB. And we have a few 32-core / 128 GB dev servers.

rgommers · 2021-07-26T16:01:18Z

If that script is all there is to it, that's quite nice. Also let's make sure it doesn't blow up disk space completely. Are they all separate Docker images, and how much space does one take?

hmaarrfk · 2021-07-27T01:05:57Z

the docker images do take space.

my build machine,'s root storage is full and docket is complaining about disk space for me.

hmaarrfk · 2021-07-28T12:13:36Z

@rgommers I'm not really sure what happened but my build time was closer to 8 hours on a AMD Ryzen 7 3700X 8-Core Processor. Maybe my processor was over subscribed, but I started a build just now and checked that nobody else was using the server for at least an hour. I can report if the build finishes in under an hour, but I somewhat doubt it.

rgommers · 2021-07-28T12:40:22Z

That seems really long. There is a lot of stuff to turn on and off, so I probably cheated here by turning a few of the expensive things off - in particular using USE_DISTRIBUTED=0. That said, here is an impression of the PyTorch CI build stages on CircleCI:

The >1 hr one is the mobile build. Regular Linux builds are in the 20-50 min range. 8 hours doesn't sound right, something must be misconfigured for you.

hmaarrfk · 2021-07-28T12:57:33Z

Thanks for the info. I think I'm not using all CPUs. I can see that only 2 are being used. I think I probably need to send an other environment variable through. I'll have to see how I can pass it through.

rgommers · 2021-07-28T12:58:53Z

MAX_JOBS is the env var that controls how many cores the pytorch build will use.

I have started one build with build-locally.py to time how long it actually takes for me.

hmaarrfk · 2021-07-28T13:00:40Z

Maxjobs is set by CPU_COUNT it seems. A conda-forge variable that sets the number of processors for CI builds.

hmaarrfk · 2021-07-28T13:02:09Z

I'm running

CPU_COUNT=$(nproc) time python build-locally.py

To see how much it helps. I think it should help alot. Thanks for helping debug.

I'm updating the suggested script too.

rgommers · 2021-07-28T13:19:30Z

$(nproc)

That's still not quite right for me. I get:

$ nproc
2
$ nproc --all
24

The optimal number is the number of physical cores I think. 24 will be slower than 12.

rgommers · 2021-07-28T13:23:35Z

It takes about 15 minutes before the build actually starts - downloading + solving the build env + cloning the repo is very slow.

And probably at the end it'll take another 10 minutes, IIRC another conda solve is needed to set up the test env. And no tests are run other than import torch, so leaving this out of the recipe could help.

hmaarrfk · 2021-07-28T18:22:19Z

do you have a dual CPU machine? or a big little architecture machine. I've found that hyperthreading does somewhat help when compiling small files.

hmaarrfk · 2021-07-28T18:22:58Z

i can update the instrucyions when i get back to my computer to divide by two.

rgommers · 2021-07-28T21:55:20Z

So okay, this does take a painfully long time. It took almost exactly 2 hours for me using 10 cores. There's no good way I can see to get a detailed breakdown of that, but my estimate based on peaking at the terminal output during meetings and the resource usage output in the build log:

15-20 min for cloning the repo and setting up the build env
1hr 20min to build
10 min to set up test env
10 min to run tests

A large part of the build time seems to be spent building Caffe2. @IvanYashchuk was looking at disabling that, hopefully it's possible (but probably nontrivial). The number of CUDA architectures to build for is the main difference between a dev build and a conda package build. For the former it's just the architecture of the GPU installed in the machine plus PTX, for the latter it's 7 or 8 architectures.

The build used 7.3 GB of disk space for build_artifacts, plus whatever the Docker image took. 2.4 GB of those 7.3 GB is for a full clone of the repo (takes a while to clone too). Why not use a shallower clone here?

Details on usage statistics from the build log:

Resource usage statistics from bundling pytorch:
   Process count: 65
   CPU time: Sys=0:24:42.5, User=11:26:44.8
   Memory: 11.5G
   Disk usage: 2.4M
   Time elapsed: 1:30:08.9

Resource usage statistics from testing pytorch:
   Process count: 12
   CPU time: Sys=0:00:23.4, User=0:06:35.3
   Memory: 2.7G
   Disk usage: 85.6K
   Time elapsed: 0:08:38.5

Resource usage statistics from testing pytorch-gpu:
   Process count: 1
   CPU time: Sys=0:00:00.0, User=-
   Memory: 3.0M
   Disk usage: 16B
   Time elapsed: 0:00:02.9

Resource usage summary:

Total time: 2:00:05.6
CPU usage: sys=0:25:05.9, user=11:33:20.1
Maximum memory usage observed: 11.5G
Total disk usage observed (not including envs): 2.5M

So it looks like if we use half the cores on a 32-core machine, the total time will be about 1hr 30min. So 16 builds takes ~24 hrs and 160 GB of space.

It's a bit painful, but still preferable to build everything on a single machine imho - less chances for mistakes to leak in.

EDIT: for completeness, the shell script to prep the build to ensure I don't pick up env vars from my default config:

unset USE_DISTRIBUTED
unset USE_MKLDNN
unset USE_FBGEMM
unset USE_NNPACK
unset USE_QNNPACK
unset USE_XNNPACK
unset USE_NCCL
unset USE_CUDA
export MAX_JOBS=10
export CPU_COUNT=10

benjaminrwilson · 2021-07-28T23:29:48Z

Awesome analysis, @rgommers. Thank you for taking the time to put this together. Do you see any potential option forward for getting these builds under the Azure CI timeout? Or other options for automatically building them on the cloud?

hmaarrfk · 2021-07-29T02:45:44Z

Why is it important to disable Caffe2 builds? Do you mean trying to share the stuff under torch_cpu in caffee between builds?

hmaarrfk · 2021-07-29T03:15:07Z

As for why we don't use shallow clones, it is because they don't end up being too shallow. And finally, it seems to be hard to checkout the tag.

I raised the issue to boa
mamba-org/boa#172

rgommers · 2021-07-29T08:22:07Z

Do you see any potential option forward for getting these builds under the Azure CI timeout? Or other options for automatically building them on the cloud?

Probably not on 2 cores in 6 hours, especially for CUDA 11, unless Caffe2 can be disabled. The list of architectures keeps growing, for 11.2 it's:

$TORCH_CUDA_ARCH_LIST;6.0;6.1;7.0;7.5;8.0;8.6

It may be possibe to prune that, but then there's deviations from the official package. A Tesla P100 or P4 (see https://developer.nvidia.com/cuda-gpus) is still in use I think, and it'll be hard to predict for users what GPUs are supported by what conda packages then.

Hooking in a custom builder so CI can be triggered is of course possible (and planned for GPU testing), but both work to implement and costly. PyTorch is not unique here. Other packages like Qt and TensorFlow have the same problem that they take too long to build. That's more a question for the conda-forge core team; I'm not aware of a plan for this.

Why is it important to disable Caffe2 builds? Do you mean trying to share the stuff under torch_cpu in caffee between builds?

No, actually disable. There's a lot that's being built there that's not needed - either relevant for mobile build, or just leftovers. Example: there's torch.nn.AvgPool2d which is what users want, and then there's a Caffe2 AveragePool2D operator which is different. The plan for official PyTorch wheels and conda packages is to get rid of Caffe2 at some point.

hmaarrfk · 2021-07-29T12:12:45Z

@rgommers i'm not sure what the path forward is for today.

Are you able to build everything over 24/48 hours? Otherwise, I can keep chugging along building on my servers over nights.

benjaminrwilson · 2021-07-29T13:49:35Z

@rgommers, @hmaarrfk, is there anyway to split up the per-architecture builds? Could we feasibly have separate jobs for each supported CUDA arch?

hmaarrfk · 2021-07-29T15:08:02Z

@benjaminrwilson they are seperated.

You can locally run python build-locally.py and select the configuration you want to build.

I've just been manually running them one at a time. rgrommers is trying to find a "more effecient" way to do this for long term maintainability.

hmaarrfk · 2021-07-29T15:08:35Z

I then upload them to my anaconda channel. Later conda-forge can download the packages from there and upload it to their own channel.

benjaminrwilson · 2021-07-29T15:30:41Z

Are the actual gpu-specific builds being separated too? Maybe I'm missing something, but it looks like the runs are split by CUDA version, but not architecture as well:

pytorch-cpu-feedstock/recipe/build_pytorch.sh

Line 92 in ac31db4

export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX"

. I mean adding another level of the build matrix as a product of the the options in that link.

rgommers · 2021-07-29T16:02:06Z

Are you able to build everything over 24/48 hours? Otherwise, I can keep chugging along building on my servers over nights.

I'm wrapping up things to go on holiday next week, so it's probably best if I didn't say yes.

Are the actual gpu-specific builds being separated too? Maybe I'm missing something, but it looks like the runs are split by CUDA version, but not architecture as well:

Indeed, I don't think there's a good way to do this.

hmaarrfk · 2021-07-29T17:05:07Z

Ah i see. TBH: This isn't the scope of this issue tracker. I really just want to get builds for pytorch 1.9 out there with GPU support.

If you think it is worth us discussing this please open a new issue to improve the build process.

We can then define goals and have a more focused discussion.

hmaarrfk · 2021-07-29T23:05:33Z

Ok. I got my hands on a system that I might reasonable be able to leave running for a day or two alone.

I've started with the MKL2021 builds on it and I'll report tomorrow if it is doing well.

hmaarrfk · 2021-07-30T12:02:22Z

3 builds = 10 hours. 16 builds = 54 hours.

I guess it should be done by the end of the weekend.

hmaarrfk · 2021-08-02T12:11:05Z

@isuruf MKL 2021 builds are complete. Is that enough for this? I might not have enough spare compute (or free time) to build for MKL 2020.

Are you able to upload to conda-forge from my channel?

h-vetinari · 2021-08-05T19:40:48Z

How are things standing with the upload of the artefacts? 🙃

benjaminrwilson · 2021-08-05T20:06:20Z

@hmaarrfk, have you been able to get in touch with @isuruf?

hmaarrfk · 2021-08-06T01:50:47Z

generally. people might be busy.

i try to ping once a week. or once every two weeks.

isuruf is very motivated. I'm sure he hasn't forgotten about this.

isuruf · 2021-08-06T01:54:46Z

@hmaarrfk, can you mark the _1 builds with a label?

hmaarrfk · 2021-08-06T03:00:45Z

Added. The label is forge

hmaarrfk · 2021-08-08T21:23:58Z

Disabling enough stuff gets things almost passing.

But as expected, when building for many GPUs at once, it does take longer and longer.

Honestly, I would like to keep building for multiple GPUs.

On my systems, I often pair up a GT 1030 with a newer GPU to utilize the newer GPU at the full extent (as opposed to using it for X11 as well)

#64

benjaminrwilson · 2021-08-08T21:47:16Z

Yeah, I completely get that. I guess one thing for us to consider is:

nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).

Additionally, we could consider is nvcc multithreading for certain CUDA versions (although, I don't think this will solve everything): https://github.com/pytorch/builder/blob/e05c57608d7ee57bdbd9075ca604b0288ad86c25/manywheel/build.sh#L263

hmaarrfk · 2021-08-08T21:49:22Z

ok, i'm trying multi threading.

benjaminrwilson · 2021-08-08T21:56:22Z

Looks like the option is available with cudatoolkit >=11.2: https://docs.nvidia.com/cuda/archive/11.2.0/cuda-compiler-driver-nvcc/index.html.

hmaarrfk · 2021-08-08T21:57:46Z

maybe then we try not to ccompress.

hmaarrfk · 2021-08-08T21:59:00Z

I guess it is time to wait 6 hours.

hmaarrfk · 2021-08-08T21:59:43Z

For what its worth, I'm rebuilding for mkl 2020 but who knows if it will finish. Maybe they will be done by next week.

hmaarrfk · 2021-08-09T12:02:33Z

Ok. I don't think I can upload anymore to my own channel. I might have to remove some packages just to make space for my day job.

isuruf · 2021-08-10T09:24:16Z

I've uploaded _1 builds

h-vetinari · 2021-08-10T09:46:32Z

Huge thanks @hmaarrfk and @isuruf for seeing this through!

hmaarrfk · 2021-08-10T22:35:44Z

@isuruf are you able to upload the _0 builds. I removed the _1 builds from my channel and added forge to all the _0 builds.

hmaarrfk · 2021-08-17T15:32:26Z

I think that the mkl2021 migration is complete and we can likely just avoid uploading the _0 builds and save some storage space on anaconda.

This was referenced Jul 30, 2021

Rebuild for pytorch conda-forge/torchvision-feedstock#28

Merged

ENH remove pytorch migration conda-forge/conda-forge-pinning-feedstock#1746

Merged

hmaarrfk changed the title ~~WIP: CUDA Build tracker~~ CUDA Build tracker Aug 3, 2021

This was referenced Aug 6, 2021

torchvision v0.10.0 conda-forge/torchvision-feedstock#27

Merged

Support for torchvision>=0.10.0 conda-forge/torchvision-feedstock#29

Closed

hmaarrfk closed this as completed Aug 17, 2021

IvanYashchuk mentioned this issue Oct 19, 2021

Add ccache caching to Azure Pipelines conda-forge/conda-smithy#1513

Closed

1 task

CUDA Build tracker #52

CUDA Build tracker #52

Comments

hmaarrfk commented Jul 25, 2021 • edited Loading

CUDA Build Tracker

Channels

IvanYashchuk commented Jul 26, 2021

hmaarrfk commented Jul 26, 2021

rgommers commented Jul 26, 2021

hmaarrfk commented Jul 26, 2021 • edited Loading

hmaarrfk commented Jul 26, 2021

rgommers commented Jul 26, 2021

rgommers commented Jul 26, 2021

hmaarrfk commented Jul 27, 2021

hmaarrfk commented Jul 28, 2021

rgommers commented Jul 28, 2021

hmaarrfk commented Jul 28, 2021

rgommers commented Jul 28, 2021

hmaarrfk commented Jul 28, 2021

hmaarrfk commented Jul 28, 2021

rgommers commented Jul 28, 2021

rgommers commented Jul 28, 2021

hmaarrfk commented Jul 28, 2021

hmaarrfk commented Jul 28, 2021

rgommers commented Jul 28, 2021 • edited Loading

benjaminrwilson commented Jul 28, 2021

hmaarrfk commented Jul 29, 2021

hmaarrfk commented Jul 29, 2021

rgommers commented Jul 29, 2021

hmaarrfk commented Jul 29, 2021

benjaminrwilson commented Jul 29, 2021

hmaarrfk commented Jul 29, 2021

hmaarrfk commented Jul 29, 2021

benjaminrwilson commented Jul 29, 2021

rgommers commented Jul 29, 2021

hmaarrfk commented Jul 29, 2021

hmaarrfk commented Jul 29, 2021

hmaarrfk commented Jul 30, 2021

hmaarrfk commented Aug 2, 2021

h-vetinari commented Aug 5, 2021

benjaminrwilson commented Aug 5, 2021

hmaarrfk commented Aug 6, 2021

isuruf commented Aug 6, 2021

hmaarrfk commented Aug 6, 2021

hmaarrfk commented Aug 8, 2021

benjaminrwilson commented Aug 8, 2021

hmaarrfk commented Aug 8, 2021

benjaminrwilson commented Aug 8, 2021

hmaarrfk commented Aug 8, 2021

hmaarrfk commented Aug 8, 2021

hmaarrfk commented Aug 8, 2021

hmaarrfk commented Aug 9, 2021

isuruf commented Aug 10, 2021

h-vetinari commented Aug 10, 2021

hmaarrfk commented Aug 10, 2021

hmaarrfk commented Aug 17, 2021

hmaarrfk commented Jul 25, 2021 •

edited

Loading

hmaarrfk commented Jul 26, 2021 •

edited

Loading

rgommers commented Jul 28, 2021 •

edited

Loading