Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Build tracker #52

Closed
hmaarrfk opened this issue Jul 25, 2021 · 50 comments
Closed

CUDA Build tracker #52

hmaarrfk opened this issue Jul 25, 2021 · 50 comments

Comments

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Jul 25, 2021

We will be starting a cuda build run after
#44

is merged. This table should help track the builds.

CUDA Build Tracker

MKL 2021: All 16 builds have been uploaded to conda-forge.

Configuration MKL 2020 16/16 MKL 2021 (16/16)
==1.9.0-*_0 ==1.9.0-*_1
label: forge On conda-forge
1 cuda 10.2 python3.6. ramonaoptics conda-forge
2 cuda 10.2 python3.7. ramonaoptics conda-forge
3 cuda 10.2 python3.8. ramonaoptics conda-forge
4 cuda 10.2 python3.9. ramonaoptics conda-forge
5 cuda 11.0 python3.6. ramonaoptics conda-forge
6 cuda 11.0 python3.7. ramonaoptics conda-forge
7 cuda 11.0 python3.8. ramonaoptics conda-forge
8 cuda 11.0 python3.9. ramonaoptics conda-forge
9 cuda 11.1 python3.6. ramonaoptics conda-forge
10 cuda 11.1 python3.7. ramonaoptics conda-forge
11 cuda 11.1 python3.8. ramonaoptics conda-forge
12 cuda 11.1 python3.9. ramonaoptics conda-forge
13 cuda 11.2 python3.6. ramonaoptics conda-forge
14 cuda 11.2 python3.7. ramonaoptics conda-forge
15 cuda 11.2 python3.8. ramonaoptics conda-forge
16 cuda 11.2 python3.9. ramonaoptics conda-forge

https://www.tablesgenerator.com/markdown_tables#

Channels

@IvanYashchuk
Copy link
Contributor

Mark, could you please explain the process shortly. I was thinking that it's not allowed to upload manually built packages to conda-forge.
People would build the listed CUDA builds and upload it to some storage, then the feedstock maintainers would upload it manually to conda-forge channel, right?

@hmaarrfk
Copy link
Contributor Author

upload them to your own public anaconda channel.

i kinda want to merge the mkl migration first

@rgommers
Copy link

This approach seems a little fragile, and more work than needed especially in the long term. There's 16 builds here, one build takes 15-25 minutes on a decent build machine. So in 4-6 hours, all binaries can be built. How about writing a reproducible and well-documented build script and letting a single person with access to a good build server build everything at once?

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Jul 26, 2021

#!/usr/env/bin bash

set -ex
conda activate base
# Ensure that the anaconda command exists for uploading
which anaconda

docker system prune --force
configs=$(find .ci_support/ -type f -name '*cuda_compiler_version[^nN]*' -printf "%p ")
anaconda upload  --skip build_artifacts/linux-64/pytorch*

# Assuming a powerful enough machine with many cores
# 10 seems to be a good point where things don't run out of RAM too much.
export CPU_COUNT=10

for config_filename in $configs; do
    filename=$(basename ${config_filename})
    config=${filename%.*}
    if [ -f build_artifacts/conda-forge-build-done-${config} ]; then
        echo skipped $config
        continue
    fi

    python build-locally.py $config
    # docker images get quite big clean them up after each build to save your disk....
    docker system prune --force
    anaconda upload  --skip build_artifacts/linux-64/pytorch*
done

15-25 mins.... what kind of machine do you have access to?

@hmaarrfk
Copy link
Contributor Author

^^^^ its kinda a serious question. I'm genuinely interested in knowing.

@rgommers
Copy link

15-25 mins.... what kind of machine do you have access to?

Desktop 12-core / 32 GB. And we have a few 32-core / 128 GB dev servers.

@rgommers
Copy link

If that script is all there is to it, that's quite nice. Also let's make sure it doesn't blow up disk space completely. Are they all separate Docker images, and how much space does one take?

@hmaarrfk
Copy link
Contributor Author

the docker images do take space.

my build machine,'s root storage is full and docket is complaining about disk space for me.

@hmaarrfk
Copy link
Contributor Author

@rgommers I'm not really sure what happened but my build time was closer to 8 hours on a AMD Ryzen 7 3700X 8-Core Processor. Maybe my processor was over subscribed, but I started a build just now and checked that nobody else was using the server for at least an hour. I can report if the build finishes in under an hour, but I somewhat doubt it.

@rgommers
Copy link

That seems really long. There is a lot of stuff to turn on and off, so I probably cheated here by turning a few of the expensive things off - in particular using USE_DISTRIBUTED=0. That said, here is an impression of the PyTorch CI build stages on CircleCI:

image

The >1 hr one is the mobile build. Regular Linux builds are in the 20-50 min range. 8 hours doesn't sound right, something must be misconfigured for you.

@hmaarrfk
Copy link
Contributor Author

Thanks for the info. I think I'm not using all CPUs. I can see that only 2 are being used. I think I probably need to send an other environment variable through. I'll have to see how I can pass it through.

@rgommers
Copy link

MAX_JOBS is the env var that controls how many cores the pytorch build will use.

I have started one build with build-locally.py to time how long it actually takes for me.

@hmaarrfk
Copy link
Contributor Author

Maxjobs is set by CPU_COUNT it seems. A conda-forge variable that sets the number of processors for CI builds.

@hmaarrfk
Copy link
Contributor Author

I'm running

CPU_COUNT=$(nproc) time python build-locally.py

To see how much it helps. I think it should help alot. Thanks for helping debug.

I'm updating the suggested script too.

@rgommers
Copy link

$(nproc)

That's still not quite right for me. I get:

$ nproc
2
$ nproc --all
24

The optimal number is the number of physical cores I think. 24 will be slower than 12.

@rgommers
Copy link

It takes about 15 minutes before the build actually starts - downloading + solving the build env + cloning the repo is very slow.

And probably at the end it'll take another 10 minutes, IIRC another conda solve is needed to set up the test env. And no tests are run other than import torch, so leaving this out of the recipe could help.

@hmaarrfk
Copy link
Contributor Author

do you have a dual CPU machine? or a big little architecture machine. I've found that hyperthreading does somewhat help when compiling small files.

@hmaarrfk
Copy link
Contributor Author

i can update the instrucyions when i get back to my computer to divide by two.

@rgommers
Copy link

rgommers commented Jul 28, 2021

So okay, this does take a painfully long time. It took almost exactly 2 hours for me using 10 cores. There's no good way I can see to get a detailed breakdown of that, but my estimate based on peaking at the terminal output during meetings and the resource usage output in the build log:

  • 15-20 min for cloning the repo and setting up the build env
  • 1hr 20min to build
  • 10 min to set up test env
  • 10 min to run tests

A large part of the build time seems to be spent building Caffe2. @IvanYashchuk was looking at disabling that, hopefully it's possible (but probably nontrivial). The number of CUDA architectures to build for is the main difference between a dev build and a conda package build. For the former it's just the architecture of the GPU installed in the machine plus PTX, for the latter it's 7 or 8 architectures.

The build used 7.3 GB of disk space for build_artifacts, plus whatever the Docker image took. 2.4 GB of those 7.3 GB is for a full clone of the repo (takes a while to clone too). Why not use a shallower clone here?

Details on usage statistics from the build log:

Resource usage statistics from bundling pytorch:
   Process count: 65
   CPU time: Sys=0:24:42.5, User=11:26:44.8
   Memory: 11.5G
   Disk usage: 2.4M
   Time elapsed: 1:30:08.9

Resource usage statistics from testing pytorch:
   Process count: 12
   CPU time: Sys=0:00:23.4, User=0:06:35.3
   Memory: 2.7G
   Disk usage: 85.6K
   Time elapsed: 0:08:38.5

Resource usage statistics from testing pytorch-gpu:
   Process count: 1
   CPU time: Sys=0:00:00.0, User=-
   Memory: 3.0M
   Disk usage: 16B
   Time elapsed: 0:00:02.9

Resource usage summary:

Total time: 2:00:05.6
CPU usage: sys=0:25:05.9, user=11:33:20.1
Maximum memory usage observed: 11.5G
Total disk usage observed (not including envs): 2.5M

So it looks like if we use half the cores on a 32-core machine, the total time will be about 1hr 30min. So 16 builds takes ~24 hrs and 160 GB of space.

It's a bit painful, but still preferable to build everything on a single machine imho - less chances for mistakes to leak in.

EDIT: for completeness, the shell script to prep the build to ensure I don't pick up env vars from my default config:

unset USE_DISTRIBUTED
unset USE_MKLDNN
unset USE_FBGEMM
unset USE_NNPACK
unset USE_QNNPACK
unset USE_XNNPACK
unset USE_NCCL
unset USE_CUDA
export MAX_JOBS=10
export CPU_COUNT=10

@benjaminrwilson
Copy link
Contributor

Awesome analysis, @rgommers. Thank you for taking the time to put this together. Do you see any potential option forward for getting these builds under the Azure CI timeout? Or other options for automatically building them on the cloud?

@hmaarrfk
Copy link
Contributor Author

Why is it important to disable Caffe2 builds? Do you mean trying to share the stuff under torch_cpu in caffee between builds?

@hmaarrfk
Copy link
Contributor Author

As for why we don't use shallow clones, it is because they don't end up being too shallow. And finally, it seems to be hard to checkout the tag.

I raised the issue to boa
mamba-org/boa#172

@rgommers
Copy link

Do you see any potential option forward for getting these builds under the Azure CI timeout? Or other options for automatically building them on the cloud?

Probably not on 2 cores in 6 hours, especially for CUDA 11, unless Caffe2 can be disabled. The list of architectures keeps growing, for 11.2 it's:

$TORCH_CUDA_ARCH_LIST;6.0;6.1;7.0;7.5;8.0;8.6

It may be possibe to prune that, but then there's deviations from the official package. A Tesla P100 or P4 (see https://developer.nvidia.com/cuda-gpus) is still in use I think, and it'll be hard to predict for users what GPUs are supported by what conda packages then.

Hooking in a custom builder so CI can be triggered is of course possible (and planned for GPU testing), but both work to implement and costly. PyTorch is not unique here. Other packages like Qt and TensorFlow have the same problem that they take too long to build. That's more a question for the conda-forge core team; I'm not aware of a plan for this.

Why is it important to disable Caffe2 builds? Do you mean trying to share the stuff under torch_cpu in caffee between builds?

No, actually disable. There's a lot that's being built there that's not needed - either relevant for mobile build, or just leftovers. Example: there's torch.nn.AvgPool2d which is what users want, and then there's a Caffe2 AveragePool2D operator which is different. The plan for official PyTorch wheels and conda packages is to get rid of Caffe2 at some point.

@hmaarrfk
Copy link
Contributor Author

@rgommers i'm not sure what the path forward is for today.

Are you able to build everything over 24/48 hours? Otherwise, I can keep chugging along building on my servers over nights.

@benjaminrwilson
Copy link
Contributor

@rgommers, @hmaarrfk, is there anyway to split up the per-architecture builds? Could we feasibly have separate jobs for each supported CUDA arch?

@hmaarrfk
Copy link
Contributor Author

@benjaminrwilson they are seperated.

You can locally run python build-locally.py and select the configuration you want to build.

I've just been manually running them one at a time. rgrommers is trying to find a "more effecient" way to do this for long term maintainability.

@hmaarrfk
Copy link
Contributor Author

I then upload them to my anaconda channel. Later conda-forge can download the packages from there and upload it to their own channel.

@benjaminrwilson
Copy link
Contributor

Are the actual gpu-specific builds being separated too? Maybe I'm missing something, but it looks like the runs are split by CUDA version, but not architecture as well:

export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX"
. I mean adding another level of the build matrix as a product of the the options in that link.

@rgommers
Copy link

Are you able to build everything over 24/48 hours? Otherwise, I can keep chugging along building on my servers over nights.

I'm wrapping up things to go on holiday next week, so it's probably best if I didn't say yes.

Are the actual gpu-specific builds being separated too? Maybe I'm missing something, but it looks like the runs are split by CUDA version, but not architecture as well:

Indeed, I don't think there's a good way to do this.

@hmaarrfk
Copy link
Contributor Author

Ah i see. TBH: This isn't the scope of this issue tracker. I really just want to get builds for pytorch 1.9 out there with GPU support.

If you think it is worth us discussing this please open a new issue to improve the build process.

We can then define goals and have a more focused discussion.

@hmaarrfk
Copy link
Contributor Author

Ok. I got my hands on a system that I might reasonable be able to leave running for a day or two alone.

I've started with the MKL2021 builds on it and I'll report tomorrow if it is doing well.

@hmaarrfk
Copy link
Contributor Author

3 builds = 10 hours. 16 builds = 54 hours.

I guess it should be done by the end of the weekend.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Aug 2, 2021

@isuruf MKL 2021 builds are complete. Is that enough for this? I might not have enough spare compute (or free time) to build for MKL 2020.

Are you able to upload to conda-forge from my channel?

@hmaarrfk hmaarrfk changed the title WIP: CUDA Build tracker CUDA Build tracker Aug 3, 2021
@h-vetinari
Copy link
Member

How are things standing with the upload of the artefacts? 🙃

@benjaminrwilson
Copy link
Contributor

@hmaarrfk, have you been able to get in touch with @isuruf?

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Aug 6, 2021

generally. people might be busy.

i try to ping once a week. or once every two weeks.

isuruf is very motivated. I'm sure he hasn't forgotten about this.

@isuruf
Copy link
Member

isuruf commented Aug 6, 2021

@hmaarrfk, can you mark the _1 builds with a label?

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Aug 6, 2021

Added. The label is forge

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Aug 8, 2021

Disabling enough stuff gets things almost passing.

But as expected, when building for many GPUs at once, it does take longer and longer.

Honestly, I would like to keep building for multiple GPUs.

On my systems, I often pair up a GT 1030 with a newer GPU to utilize the newer GPU at the full extent (as opposed to using it for X11 as well)

#64

@benjaminrwilson
Copy link
Contributor

Yeah, I completely get that. I guess one thing for us to consider is:

nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).

Additionally, we could consider is nvcc multithreading for certain CUDA versions (although, I don't think this will solve everything): https://github.com/pytorch/builder/blob/e05c57608d7ee57bdbd9075ca604b0288ad86c25/manywheel/build.sh#L263

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Aug 8, 2021

ok, i'm trying multi threading.

@benjaminrwilson
Copy link
Contributor

Looks like the option is available with cudatoolkit >=11.2: https://docs.nvidia.com/cuda/archive/11.2.0/cuda-compiler-driver-nvcc/index.html.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Aug 8, 2021

maybe then we try not to ccompress.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Aug 8, 2021

I guess it is time to wait 6 hours.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Aug 8, 2021

For what its worth, I'm rebuilding for mkl 2020 but who knows if it will finish. Maybe they will be done by next week.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Aug 9, 2021

Ok. I don't think I can upload anymore to my own channel. I might have to remove some packages just to make space for my day job.

image

@isuruf
Copy link
Member

isuruf commented Aug 10, 2021

I've uploaded _1 builds

@h-vetinari
Copy link
Member

Huge thanks @hmaarrfk and @isuruf for seeing this through!

@hmaarrfk
Copy link
Contributor Author

@isuruf are you able to upload the _0 builds. I removed the _1 builds from my channel and added forge to all the _0 builds.

@hmaarrfk
Copy link
Contributor Author

I think that the mkl2021 migration is complete and we can likely just avoid uploading the _0 builds and save some storage space on anaconda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants