Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting this package in managable chunks #108

Open
hmaarrfk opened this issue May 28, 2022 · 55 comments
Open

Splitting this package in managable chunks #108

hmaarrfk opened this issue May 28, 2022 · 55 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@hmaarrfk
Copy link
Contributor

hmaarrfk commented May 28, 2022

Comment:

This package currently requires more than 16 builds to be build manually to ensure that it completes in time on the CIs.

Step 1: No more git clone

rgommers identified that one portion of the build process that takes time is cloning the repository. In my experience, cloning the 1.5GB repo can take up to 10 min on my powerful local machine, but I feel like it can take much longer on the CIs.

To avoid cloning, we will have to list out all the submodule manually, or make the conda-forge installable dependencies.

I mostly got this working using a recursive script which should help us keep it maintained: #109

Option 1: Split off Dependencies:

Dependency linux mac win GPU Aware PR system deps
pybind11 no https://github.com/conda-forge/pybind11-feedstock USE_SYSTEM_PYBIND11
cub no https://github.com/conda-forge/cub-feedstock
eigen no https://github.com/conda-forge/eigen-feedstock USE_SYSTEM_EIGEN_INSTALL
googletest no will not package
benchmark no https://github.com/conda-forge/benchmark-feedstock
protobuf no https://github.com/conda-forge/libprotobuf-feedstock
ios-cmake not needed since we don't target ios
NNPACK yes yes no conda-forge/staged-recipes#19103
gloo yes yes yes conda-forge/staged-recipes#19103 USE_SYSTEM_GLOO
pthreadpool yes yes no conda-forge/staged-recipes#19103 USE_SYSTEM_PTHREADPOOL
FXdiv yes yes header conda-forge/staged-recipes#19103 USE_SYSTEM_FXDIV
FP16 yes yes header conda-forge/staged-recipes#19103 USE_SYSTEM_FP16
psimd yes yes header conda-forge/staged-recipes#19103 USE_SYSTEM_PSIMD
zstd yes yes yes no https://github.com/conda-forge/zstd-feedstock
cpuinfo yes yes no no conda-forge/staged-recipes#19103 USE_SYSTEM_CPUINFO
python-enum no https://github.com/conda-forge/enum34-feedstock
python-peachpy yes yes yes no conda-forge/staged-recipes#19103
python-six yes yes yes no https://github.com/conda-forge/six-feedstock
onnx no https://github.com/conda-forge/onnx-feedstock USE_SYSTEM_ONNX
onnx-tensorrt only
sleef no https://github.com/conda-forge/sleef-feedstock USE_SYSTEM_SLEEF
ideep
oneapisrc
nccl https://github.com/conda-forge/nccl-feedstock
gemmlowp
QNNPACK yes yes conda-forge/staged-recipes#19103
neon2sse
fbgemm yes
foxi
tbb https://github.com/conda-forge/tbb-feedstock USE_SYSTEM_TBB (deprecated)
fbjni
XNNPACK yes yes conda-forge/staged-recipes#19103 USE_SYSTEM_XNNPACK
fmt https://github.com/conda-forge/fmt-feedstock
tensorpipe yes
cudnn_frontend
kineto
pocketfft
breakpad
flatbuffers yes yes yes no https://github.com/conda-forge/flatbuffers-feedstock
clog static static conda-forge/staged-recipes#19103
  • clog seems to be a pretty low level library that is assisted by compile time flags. I think it is best if we don't package that one as a library. It seems like it will require some serious consideration in terms of performance if we do. They typically the full source in the repository. The only problematic thing, is that each package attempts to install the static library into the library path.
  • QNNPACK has a build option to allow a special provision for CAFFE2's implementation of pthreadpool
    • It seems to be problematic with pthreadpool on OSX.
  • QNNPACK likely has two different implementations, the one they vendored in ATen, and the one they vendored in third_party.
  • NNPACK has two different backens, one generated by python it seems, but for some reason fp16.py cannot be found, the other with psimd.

Option 2 - step 1: Build a libpytorch package or something

By setting BUILD_PYTHON=OFF in #112 we then end up with the following libraries in lib and include:

Dependency linux mac win GPU Aware PR
libasmjit yes yes conda-forge/staged-recipes#19103
libc10 yes yes conda-forge/staged-recipes#19103
libfbgemm yes yes yes conda-forge/staged-recipes#19103
libgloo yes yes yes
libkineto yes yes conda-forge/staged-recipes#19103
libnnpack yes ??? conda-forge/staged-recipes#19103
libpytorch_qnnpack yes yes conda-forge/staged-recipes#19103
libqnnpack yes yes conda-forge/staged-recipes#19103
libtensorpipe yes
libtorch
libtorch_cpu
libtorch_global_deps
Header only
ATen
c10d
caffe2
libnop yes yes conda-forge/staged-recipes#19103

Option 2 - step 2: Depend on new ATen/libpytorch package

Compilation time progress

platform python cuda main tar gh-109 system deps
linux 64 3.7 no 1h57m 1h54m
linux 64 3.8 no 2h0m 1h51m
linux 64 3.9 no 2h31m 2h2m
linux 64 3.10 no 2h26m 2h7m
linux 64 3.7 11.2 6h+ (3933/4242 309 remaining) 6h+
linux 64 3.8 11.2 6h+ (3897/4242 345 remaning) 6h+
linux 64 3.9 11.2 6h+ (3924/4242 318 remaining) 6h+ 6h+1656/1969 313 remaining
linux 64 3.10 11.2 6h+ (3962/4242 280 remaining) 6h+
osx-64 3.7 2h42m 2h39m
osx-64 3.8 3h28m 2h52m
osx-64 3.9 2h40m 2h42m
osx-64 3.10 3h2m 2h42m
osx-arm-64 3.8 1h51 1h37m
osx-arm-64 3.9 2h20m 2h10m
osx-arm-64 3.10 4h25m 2h1m

There are approximately:

  • 3600 files to compile for cmake for the CPU builds with the standard build process
  • 1600-1800 files to compile when using system dependencies: WIP: Use more system libs #111
@hmaarrfk hmaarrfk added question Further information is requested enhancement New feature or request help wanted Extra attention is needed and removed question Further information is requested labels May 28, 2022
@rgommers
Copy link

To avoid cloning, we will have to list out all the submodule manually, or make the conda-forge installable dependencies.

Cloning with --depth 1 seems preferable to separately building as dependencies. Separate dependencies/feedstocks/packages is a lot of overhead and noise for something that isn't usable by anything other than this feedstock.

The script in gh-109 looks interesting. Should work like that I guess; I just forgot why using --depth doesn't work? Seems like a lacking feature in git itself if it doesn't allow a shallow clone.

@hmaarrfk
Copy link
Contributor Author

I think the problem is that conda first clones the main branch with depth 1, then cannot switch to an older tag like version v1.11.0 because it didn't clone it.

It also didn't play well with caching.

It is somewhat of a job to unbundle but i guess I find it worthwhile if it means we can release this more easily. I'm hoping i can patch things in a way that is acceptable upstream.

@hmaarrfk
Copy link
Contributor Author

I remember what conda tried to do:

  • it tried to clone a bare repo locally.
  • then uses that as a cache to clone the sources for the build.

That's not super valuable for CI workflows, but makes it hard to do a shallow clone. At the time I couldn't think of a solution to propose upstream to conda build.

@rgommers
Copy link

It is somewhat of a job to unbundle but i guess I find it worthwhile if it means we can release this more easily. I'm hoping i can patch things in a way that is acceptable upstream.

Makes sense. I have no problem with unbundling provided it doesn't change the sources that are built. The kind of unbundling Linux distros do, like "hey this project is pinned to version X of , but we insist on using our own version Y" is much more problematic, because you then build a combo of sources that is not tested at all in upstream CI, and may be plain buggy.

@hmaarrfk
Copy link
Contributor Author

We kinda do that with along of c dependencies don't we (not as much in pytorc)

My hope is that i can split off onnx and ATen in versions that match pytorch.

@hmaarrfk
Copy link
Contributor Author

You can follow somewhat of a first pass at step 2 here conda-forge/staged-recipes#19103 (comment)

There are quite a few header only, and other libraries, that get downloaded on the fly using custom cmake code.

That isn't really fun. So even what I did in gh-109 isn't really complete, in terms of not downloading during bulid.

@h-vetinari
Copy link
Member

h-vetinari commented May 29, 2022

That isn't really fun. So even what I did in gh-109 isn't really complete, in terms of not downloading during bulid.

That list of submodules is insane. Reminds me of what I quipped in #76 when I first came across that:

Pytorch has [...] under third_party/ (side note: holy maccaroni, that folder is a packager's nightmare 😅).

Seems even that was underestimating the extent of the issue. Unsurprisingly, I really dislike this "we vendor specific commits of open source projects" development model - it's a very "my castle" approach.

On the other hand, I see where it is coming from, with C/C++'s complete lack of standardised tooling around distribution.

@h-vetinari
Copy link
Member

But I don't get so many things in that list, especially mature projects. Why vendor six? tbb? fmt? pybind11? The list goes on.

All in all, I fully support ripping this apart one by one (hopefully even in ways that would be palatable upstream), but I get Ralf's point about not diverging from what's actually being tested - though I'd be fine to caveat that based on an actually conceivable risk of breakage (e.g. if there are no functional changes between the vendored commit and a released version in a given submodule)

@hmaarrfk
Copy link
Contributor Author

On the other hand, I see where it is coming from, with C/C++'s complete lack of standardised tooling around distribution.

Right. This is likely what the original creators were grappling with. They decided to either use git submodules in certain projects, or cmake code to download things they needed. Bazel does the same.

But I don't get so many things in that list, especially mature projects. Why vendor six? tbb? fmt? pybind11? The list goes on.

The issue occurs on who is in charge of the support. pytorch (and facebook) cannot force six or tbb to push a fix if their users (other developers at facebook) find a problem. Eventually, one user will have an issue. Because they have the developer resources, they decide to take on the responsibility of maintaining it for their package.

When pip was the only option, you were beholden to the creator of the original package on pypi pleading them to support a feature you need (i've been there many time, and in a sense, we are there with our packaging asks for pytorch).

@h-vetinari
Copy link
Member

The issue occurs on who is in charge of the support. pytorch (and facebook) cannot force six or tbb to push a fix if their users (other developers at facebook) find a problem. Eventually, one user will have an issue. Because they have the developer resources, they decide to take on the responsibility of maintaining it for their package.

Sure, but what's missing IMO is closing the loop to a released version with the bugfix afterwards.

@hmaarrfk
Copy link
Contributor Author

Sure, but what's missing IMO is closing the loop to a released version with the bugfix afterwards.

Its pretty hard to make a business case as to why you should spend a few hours, and likely more time, submitting a fix upstream after you have fixed things for your users.

Anyway, i'm just going through listing things that need to be done. There are a few big packages that we might be able to take advantage from.

@rgommers
Copy link

I think you're both missing a very important point: dependencies are fragile. Once you have O(25) dependencies, and would actually express them as dependencies, you become susceptible to a ton of extra bugs (even aside from a ton more packaging/distribution issues). It simply isn't workable.

I had to explain the same thing when SciPy added Boost as a submodule. Boost 1.75 was known to work, 1.76 is known to be broken, yet other versions are unknown. Having a single tested version limits the surface area of things you're exposed to, and also makes it easier to reproduce and debug issues. PyTorch has zero interesting dependencies at runtime (not even numpy anymore), and only one config of build-time library dependencies that are vendored in third-party/.

There's a few libraries, e.g. pybind11, that are well-maintained and low-risk to unbundle. But most of them aren't like that.

There's of a course a trade-off here- in build time, and "let's find bugs early so we can get them fixed for the greater good", but on average PyTorch is doing the right thing here if they want users to have a good experience.

Why vendor six?

six is designed to be vendored. As are other such simple utilities, like versioneer. It's not strange at all - dependencies are expensive.

@hmaarrfk
Copy link
Contributor Author

rgommer, i really agree with:

dependencies are fragile.

which is why i brought up the case of support. You want to be able to control it if you are in charge of shipping a product.

I actually think we should likely skip the unbundling, and build an intermediary output instead. I'm mostly using this effort to try to understand them, and understand their build system.

I changed the top post to reflect this, listing "unbundling" and the intermediary library as two distinct options (potentially complimentary).

@hmaarrfk
Copy link
Contributor Author

So I think I've gone as far as I want to. I actually got to the point where I was exactly 1 year ago, where I was trying to build ideep.

conda-forge/staged-recipes#7491

Ultimately, my concern isn't the fact that I can build it, I think I can. But rather my concern is wether or not I can build it with similar enough options that pytorch tests with. That i'm not super excited about.

@h-vetinari
Copy link
Member

I think you're both missing a very important point: dependencies are fragile. Once you have O(25) dependencies, and would actually express them as dependencies, you become susceptible to a ton of extra bugs (even aside from a ton more packaging/distribution issues). It simply isn't workable.

I agree with you on a lot of this, but let's please avoid assuming who's missing this point or that. I didn't say that everything should to be a direct dependency, or that that there can't be good reasons for moving to unreleased commits as a stopgap measure (with a work item to move back to a released version as it becomes available), or that it's inherently bad practice (the lack of good tooling forces projects into making really bad trade-offs, but disliking that state of affairs is not an accusation towards anyone).

But with ~60 submodules, not doing that makes integration work pretty much impenetrable, as we've seen for pytorch & tensorflow. I get that this discipline (or extra infrastructure for not using the vendored sources) has low perceived value for companies like Google and Meta, and this is a large part how the situation got to this point (in addition to the lack of good tooling e.g. like cargo).

I don't claim to have the answer (mostly questions) - if someone had a cookie-cutter solution, we'd have seen it by now. I still think that untangling this web of dependencies (possibly also into intermediate pieces) would be very worthwhile both for conda-forge itself and for upstream. Sadly, tensorflow hasn't even shown slight interest in fixing their circular build dependencies, so it's an uphill battle, and we have quite a ways to go on that...

@hmaarrfk
Copy link
Contributor Author

@h-vetinari if you want to help on this effort, I think packaging onnx-tensorrt would be very helpful and is quite independent from the effort here.

I don't think it is as easy to plug it in, but I think it does add to the compilation time since I think it is GPUaware. So is fbgemm.

@hmaarrfk
Copy link
Contributor Author

actually, just building libonnx would likely be a welcome first step!

@hmaarrfk
Copy link
Contributor Author

But I don't get so many things in that list, especially mature projects. Why vendor six? tbb? fmt? pybind11? The list goes on.

In all fairness, they do provide overrides to "mature projects". We just never felt it was a good idea to use them since they don't really move the needle in terms of compilation time.

Ultimately, it is the projects that are "less mature" that they are using exact commits to.

Again, in fairness to them, these are fast moving projects, that seems to have been built quickly, for the specific use case of enabling caffe/caffe2/torch/pytorch.

The other category seems to be GPU packages that need to be built harmoniously with pytorch. Honestly, this feels a little bit like a "conda-forge" problem in the sense that if we had more than 6 hours of compilation time, and likely more than 2 cores to compile on, we could build in the prescribed amount of time.

Pytorch is:

  • Documenting their versions
  • Not depending on any closed source build system

Which is honestly more than we can hope for.

@h-vetinari
Copy link
Member

Pytorch is:

  • Documenting their versions
  • Not depending on any closed source build system

Which is honestly more than we can hope for.

Yes, that's a great start. I disagree that we can't have higher aspirations though. 🙃

Honestly, this feels a little bit like a "conda-forge" problem in the sense that if we had more than 6 hours of compilation time, and likely more than 2 cores to compile on, we could build in the prescribed amount of time.

Indisputably, though 6h is already a whole bunch more than we had in pre-azure days. "capable of building on public CI" (in some sequence of individual chunks) is not an unreasonable wish I think.

@h-vetinari if you want to help on this effort, I think packaging onnx-tensorrt would be very helpful and is quite independent from the effort here.

Yes, interested, but low on time at the moment...

@rgommers
Copy link

Indisputably, though 6h is already a whole bunch more than we had in pre-azure days. "capable of building on public CI" (in some sequence of individual chunks) is not an unreasonable wish I think.

Agreed, that would be a good thing to have, and a reasonable ask to upstream (which I'll make next time I meet with build & packaging folks). Looking at the updated table, there's only a couple of builds that don't fit and they're not ridiculously far from the limit: ~ 6h+ (3933/4242 309 remaining). That said, breaking it in half so it comfortably fits would be better.

Another thing that is likely coming in a future release is the ability to reuse the non-CPython-specific parts between builds. Because 95% of the build doesn't depend on Python version, so having to rebuild everything 4 times for each supported Python version is a massive waste.

@hmaarrfk
Copy link
Contributor Author

@rgommers FWIW, you essentially fly through most of the builds until you get to the large GPU kernels which need to be compiled for every data type, every GPU architecture, and then all put together. So the "3000 files to compile" vs "1800" is really misleading since only 500 files take the compilation time.

As for building as a library: by adjusting the tests, I should be a good place to get the CPU build of #112 working. It doesn't seem to move the needle very much. Again, due to the fact that the intensive stuff still takes as much time as it did before. (The CPU build still takes about 2 hours even without the python stuff).

@hmaarrfk
Copy link
Contributor Author

ok, i spoke too soon. While you can disable BUILD_PYTHON by setting it to OFF or 0.

It seems to be hard to USE the prebuild library that you install in an earlier run.

There seems to be 3 natural checkpoints that they create for their own reasons that might be helpful to us. These checkpoint already get installed, but in their standard build process get "copied" into the python module (as required by pip installed packages)

  1. libc10
  2. libtorch_cpu (this seems like it contains some GPU symbols too -- This seems to take 1.5hrs for CPU only builds. so on the 6 hour constrained GPU builds, splitthing this off as an extra package would be helpful.
  3. libtorch_gpu

They all seem to get assembled by libtorch

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Jun 1, 2022

I'm not really sure conda is setup to detect the precise hardware, but rather the version of the cuda library.

It is quite hard to choose a hardware cutoff value. I don't really want to be choosing it at this level.

I personally have some setups with new and old GPUs. Crazy right! Though I may be an exception. I would be happy if things worked on my fancy new one.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Jun 2, 2022

even more radically, we can even try to split our packages into compute targets, not cuda targets.

Please open an other issue regarding dropping architectures.

Maybe: https://github.com/conda-forge/conda-forge.github.io/issues?q=is%3Aissue+is%3Aopen+gpu

@ngam
Copy link
Contributor

ngam commented Jun 6, 2022

Do you think it would be reasonable to ask them to:

  1. Split off the pytorch build more cleanly from the libtorch build?

This, in theory, should be quite beneficial to upstream. Btw, mxnet does exactly this and it works quite well in my experience. Mentioning mxnet as an option in case you want to see their build setup. (I am not sure about number 2 in this list; I don't have a full understanding)

@hadim
Copy link
Member

hadim commented Dec 17, 2022

If the main motivation to split pytorch in smaller packages is because of CI time constraints then what about GH Large Runners?

Just throwing an idea here in case it can decrease the maintenance burden. It seems more and more important as more and more cf packges are built against pytorch.

@hmaarrfk
Copy link
Contributor Author

If the main motivation to split pytorch in smaller packages is because of CI time constraints

This is an important motivation. And likely the most critical one.

As a second bonus, i would rather not have 4x the number of uploads for each python version.

then what about GH Large Runners?

I'm not sure how to use them at Conda-forge. Do you know how to enable it? PR welcome!

@hadim
Copy link
Member

hadim commented Dec 17, 2022

We use github_actions as the main CI in our private conda-forge-like organization but it seems like it's not allowed to do that on conda-forge.

When editing conda-forge.yml and adding:

provider:
  linux_64: ["github_actions"]
  osx_64: ["github_actions"]
  win_64: ["github_actions"]

then regeneration fails because of:

INFO:conda_smithy.configure_feedstock:Applying migrations: /tmp/tmpba_a2ikw/share/conda-forge/migrations/python311.yaml
Traceback (most recent call last):
  File "/home/hadim/local/micromamba/bin/conda-smithy", line 10, in <module>
    sys.exit(main())
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/cli.py", line 670, in main
    args.subcommand_func(args)
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/cli.py", line 486, in __call__
    self._call(args, tmpdir)
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/cli.py", line 491, in _call
    configure_feedstock.main(
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/configure_feedstock.py", line 2289, in main
    render_github_actions(env, config, forge_dir, return_metadata=True)
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/configure_feedstock.py", line 1275, in render_github_actions
    return _render_ci_provider(
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/configure_feedstock.py", line 653, in _render_ci_provider
    raise RuntimeError(
RuntimeError: Using github_actions as the CI provider inside conda-forge github org is not allowed in order to avoid a denial of service for other infrastructure.

Also that would only enable the regular GH Actions workers and not the large runners ones for which I think we must pay (that being said it's probably worth putting some money on this, happy to contribute as well).

@hmaarrfk do you think it would be possible to make an exception here by enabling GH Actions as CI only for that repo? That would be only to perform a couple of build experiments and check whether it's worth or not before moving to potentially large runners.

@hmaarrfk
Copy link
Contributor Author

that being said it's probably worth putting some money on this, happy to contribute as well)

Hmm. I'm not sure how donations are managed. Lets not get side tracked by this conversation here. but maybe you can express your desires in https://github.com/conda-forge/conda-forge.github.io for greater visibility.

do you think it would be possible to make an exception here by enabling GH Actions as CI only for that repo? That would be only to perform a couple of build experiments and check whether it's worth or not before moving to potentially large runners.

you can probably edit out the check in configure_feedstock.py yourself. have you tried that?

@h-vetinari
Copy link
Member

that being said it's probably worth putting some money on this, happy to contribute as well)

Hmm. I'm not sure how donations are managed. Lets not get side tracked by this conversation here. but maybe you can express your desires in https://github.com/conda-forge/conda-forge.github.io for greater visibility.

See here. There have been ongoing efforts to get something like this done for well over two years, but there are a lot of moving pieces (not all of them technical) to sort out.

@h-vetinari
Copy link
Member

Just saw this recent upstream issue about splitting off a libtorch -- that would be amazing for us. Given the 6h timeout limit, I'd suggest we build this on a different feedstock and then depend on it here.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 2, 2023

I feel like it might be time to try again with pytorch 2.x.... i'm just kinda tired of locking up some of my servers compiling this stuff.

@carterbox
Copy link
Member

I've been running build benchmarks recently by piping the build logs through ts. I don't have any results yet, but the ideas I've been playing with are:

  1. Building for major archs only
  2. Trying to speed up linking by using mold instead of ld
  3. Playing with NVCC compile options: specifically --threads which was introduced in CUDA 11.5 and separable compilation

I'm compiling libtorch without python. If I can't get that below 6 hours with 2 cores, then it's still not plausible to build the entire package on the feedstock.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 2, 2023

I would be happy just having to build one or two libraries, to then start a ci job for all the different python packages.

these libraries could be built in a different feedstocks if needed.

@carterbox
Copy link
Member

🤔 You are suggesting that you would built libtorch offline (at most 2 archs x 3 platforms x 2 blas x 2 cuda), then the feedstock would build pytorch (at most 4 python x 2 archs x 3 platforms).

platforms - osx, win, linux
archs - arm, ppc64le, x86
blas - mkl, openblas
cuda - 11.8, 12.0

That makes some sense. Do we already have a feel for how much time it takes to compile libtorch vs the python extension modules? Do the Python extension modules even have a CUDA dependence or do they just link to any libtorch_cuda.so?

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 3, 2023

platforms would be limited to at most linux + cuda.

Others seem fine

@carterbox
Copy link
Member

Here's the results from my local machine for build-only (no setup including cloning or downloading deps), the build time difference between -DBUILD_PYTHON:BOOL=ON/OFF seems negligible.

On my machine using cmake --parallel 2, CUDA 12.0, and nvcc --threads 2 :

  • 4.50 hours major and minor archs listed in the current recipe
  • 2.75 hours only major archs
  • 2.50 hours only major archs using mold as the linker.

No sure how much slower it will be running in a docker container on the CI.

In summary, the most immediate strategy for reducing build times which is not discussed above should be to prune cuda target archs to major only. This may reduce build times by somewhere between half and a third? Who knows, it might bring build time down to an unreliable 5.9 hours. 😆

As mentioned above, patch work with upstream on the CMakeLists so that pytorch can be build separately from libtorch (in another feedstock) would be probably be helpful too. Since the python specific build time seems negligible, this won't reduce build time for a single variant but should reduce the build matrix and thus build time over all variants.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 4, 2023

I don't really want to have the conversation of the supported architectures in individual feedstocks.

Can we have the discussion a more central location like:
https://github.com/conda-forge/cuda-feedstock

Then maybe we can have best practices established.

@isuruf
Copy link
Member

isuruf commented Nov 4, 2023

Agree with @hmaarrfk. Please have a look at #114 too.

@carterbox
Copy link
Member

I already started a discussion about standardizing the archs that feedstocks target at the conda-forge.github.io repo conda-forge/conda-forge.github.io#1901 I'd be happy to move the discussion there. I don't think the cuda-feedstock is the place for that discussion because it's not an issue with the cuda package itself, it's a discussoin about our channel policy and is more similar to whether on not packages should target special instruction sets like AVIX-512.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Oct 8, 2024

In 2024, the most important aspect might be the fact that it currently takes about 1hour on our CIs to get through intel ideep. if we unvendor it, it would could speed up iteration time on our linux CIs.

mgorny added a commit to mgorny/pytorch-cpu-feedstock that referenced this issue Nov 8, 2024
Patch pytorch to use the sysem mkldnn library from the onednn package
rather than building one locally from within ideep submodules.  Given
that ideep itself is a header-only library, I presume this is what
was meant in conda-forge#108 (comment),
and indeed unvendoring onednn seems to improve build time significantly.

That said, our onednn package does not support GPU runtime
(conda-forge/onednn-feedstock#44) but at least
according to my testing, that part of the library was not enabled by our
PyTorch builds before (due to missing SYCL).

The patch is a bit hacky, and probably needs some polishing before being
submitted upstream (and testing on other platforms).

Part of issue conda-forge#108
@mgorny mgorny mentioned this issue Nov 8, 2024
5 tasks
mgorny added a commit to mgorny/pytorch-cpu-feedstock that referenced this issue Nov 8, 2024
Patch pytorch to use the system mkldnn library from the onednn package
rather than building one locally from within ideep submodules.  Given
that ideep itself is a header-only library, I presume this is what
was meant in conda-forge#108 (comment),
and indeed unvendoring onednn seems to improve build time significantly.

That said, our onednn package does not support GPU runtime
(conda-forge/onednn-feedstock#44) but at least
according to my testing, that part of the library was not enabled by our
PyTorch builds before (due to missing SYCL).

The patch is a bit hacky, and probably needs some polishing before being
submitted upstream (and testing on other platforms).

Part of issue conda-forge#108
@mgorny
Copy link
Contributor

mgorny commented Nov 15, 2024

In 2024, the most important aspect might be the fact that it currently takes about 1hour on our CIs to get through intel ideep. if we unvendor it, it would could speed up iteration time on our linux CIs.

For the record, as I've noted in #289, ideep is a header-only library and the part taking lots of time is mkldnn (AKA oneDNN) — which for some reason is vendored inside ideep but built directly.

That said, I've done some timings, using the non-CUDA build for a start. According to my numbers, mkldnn took around 6 minutes here, and the next unvendoring candidate would be XNNPACK — at a glance, it seems to take around 3 minutes, but it's hard to get exact numbers, because it isn't built in one big chunk here, but split between other libtorch files.

Beyond these two dependencies, I didn't notice anything else taking significant time. I've used linux_64_blas_implgenericc_compiler_version13c_stdlib_version2.17cuda_compilerNonecuda_compiler_versionNonecxx_compiler_version13 setup for these timings.

@hmaarrfk
Copy link
Contributor Author

at a glance, it seems to take around 3 minutes, but it's hard to get exact numbers, because it isn't built in one big chunk here, but split between other libtorch files.

This is what I recall in my build times.

I think the onednn is a good clue. It takes approximately 1 hour for that compilation to happen on our CIs.

https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=1075634&view=logs&jobId=b2359324-026b-5f33-b53d-e15b134e3e00

But as you said, it is hard to estimate the "real world" improvements. here especially if they come at the cost of complexity in package maintenance in the onednn feedstock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

8 participants