Splitting this package in managable chunks #108

hmaarrfk · 2022-05-28T22:50:55Z

Comment:

This package currently requires more than 16 builds to be build manually to ensure that it completes in time on the CIs.

Step 1: No more git clone

rgommers identified that one portion of the build process that takes time is cloning the repository. In my experience, cloning the 1.5GB repo can take up to 10 min on my powerful local machine, but I feel like it can take much longer on the CIs.

To avoid cloning, we will have to list out all the submodule manually, or make the conda-forge installable dependencies.

I mostly got this working using a recursive script which should help us keep it maintained: #109

Option 1: Split off Dependencies:

Dependency	linux	mac	win	GPU Aware	PR	system deps
pybind11				no	https://github.com/conda-forge/pybind11-feedstock	USE_SYSTEM_PYBIND11
cub				no	https://github.com/conda-forge/cub-feedstock
eigen				no	https://github.com/conda-forge/eigen-feedstock	USE_SYSTEM_EIGEN_INSTALL
googletest				no	will not package
benchmark				no	https://github.com/conda-forge/benchmark-feedstock
protobuf				no	https://github.com/conda-forge/libprotobuf-feedstock
ios-cmake					not needed since we don't target ios
NNPACK	yes	yes		no	conda-forge/staged-recipes#19103
gloo	yes	yes		yes	conda-forge/staged-recipes#19103	USE_SYSTEM_GLOO
pthreadpool	yes	yes		no	conda-forge/staged-recipes#19103	USE_SYSTEM_PTHREADPOOL
FXdiv	yes	yes		header	conda-forge/staged-recipes#19103	USE_SYSTEM_FXDIV
FP16	yes	yes		header	conda-forge/staged-recipes#19103	USE_SYSTEM_FP16
psimd	yes	yes		header	conda-forge/staged-recipes#19103	USE_SYSTEM_PSIMD
zstd	yes	yes	yes	no	https://github.com/conda-forge/zstd-feedstock
cpuinfo	yes	yes	no	no	conda-forge/staged-recipes#19103	USE_SYSTEM_CPUINFO
python-enum				no	https://github.com/conda-forge/enum34-feedstock
python-peachpy	yes	yes	yes	no	conda-forge/staged-recipes#19103
python-six	yes	yes	yes	no	https://github.com/conda-forge/six-feedstock
onnx				no	https://github.com/conda-forge/onnx-feedstock	USE_SYSTEM_ONNX
onnx-tensorrt				only
sleef				no	https://github.com/conda-forge/sleef-feedstock	USE_SYSTEM_SLEEF
ideep
oneapisrc
nccl					https://github.com/conda-forge/nccl-feedstock
gemmlowp
QNNPACK	yes	yes			conda-forge/staged-recipes#19103
neon2sse
fbgemm				yes
foxi
tbb					https://github.com/conda-forge/tbb-feedstock	USE_SYSTEM_TBB (deprecated)
fbjni
XNNPACK	yes	yes			conda-forge/staged-recipes#19103	USE_SYSTEM_XNNPACK
fmt					https://github.com/conda-forge/fmt-feedstock
tensorpipe				yes
cudnn_frontend
kineto
pocketfft
breakpad
flatbuffers	yes	yes	yes	no	https://github.com/conda-forge/flatbuffers-feedstock
clog	static	static			conda-forge/staged-recipes#19103

clog seems to be a pretty low level library that is assisted by compile time flags. I think it is best if we don't package that one as a library. It seems like it will require some serious consideration in terms of performance if we do. They typically the full source in the repository. The only problematic thing, is that each package attempts to install the static library into the library path.
QNNPACK has a build option to allow a special provision for CAFFE2's implementation of pthreadpool
- It seems to be problematic with pthreadpool on OSX.
QNNPACK likely has two different implementations, the one they vendored in ATen, and the one they vendored in third_party.
NNPACK has two different backens, one generated by python it seems, but for some reason fp16.py cannot be found, the other with psimd.

Option 2 - step 1: Build a libpytorch package or something

By setting BUILD_PYTHON=OFF in #112 we then end up with the following libraries in lib and include:

Dependency	linux	mac	GPU Aware	PR
libasmjit	yes	yes		conda-forge/staged-recipes#19103
libc10	yes	yes		conda-forge/staged-recipes#19103
libfbgemm	yes	yes	yes	conda-forge/staged-recipes#19103
libgloo	yes	yes	yes
libkineto	yes		yes	conda-forge/staged-recipes#19103
libnnpack	yes		???	conda-forge/staged-recipes#19103
libpytorch_qnnpack	yes	yes		conda-forge/staged-recipes#19103
libqnnpack	yes	yes		conda-forge/staged-recipes#19103
libtensorpipe			yes
libtorch
libtorch_cpu
libtorch_global_deps
Header only
ATen
c10d
caffe2
libnop	yes	yes		conda-forge/staged-recipes#19103

Option 2 - step 2: Depend on new ATen/libpytorch package

Compilation time progress

platform	python	cuda	main	tar gh-109	system deps
linux 64	3.7	no	1h57m	1h54m
linux 64	3.8	no	2h0m	1h51m
linux 64	3.9	no	2h31m	2h2m
linux 64	3.10	no	2h26m	2h7m
linux 64	3.7	11.2	6h+ (`3933/4242` 309 remaining)	6h+
linux 64	3.8	11.2	6h+ (`3897/4242` 345 remaning)	6h+
linux 64	3.9	11.2	6h+ (`3924/4242` 318 remaining)	6h+	6h+`1656/1969` 313 remaining
linux 64	3.10	11.2	6h+ (`3962/4242` 280 remaining)	6h+
osx-64	3.7		2h42m	2h39m
osx-64	3.8		3h28m	2h52m
osx-64	3.9		2h40m	2h42m
osx-64	3.10		3h2m	2h42m
osx-arm-64	3.8		1h51	1h37m
osx-arm-64	3.9		2h20m	2h10m
osx-arm-64	3.10		4h25m	2h1m

There are approximately:

3600 files to compile for cmake for the CPU builds with the standard build process
1600-1800 files to compile when using system dependencies: WIP: Use more system libs #111

The text was updated successfully, but these errors were encountered:

rgommers · 2022-05-29T07:07:37Z

To avoid cloning, we will have to list out all the submodule manually, or make the conda-forge installable dependencies.

Cloning with --depth 1 seems preferable to separately building as dependencies. Separate dependencies/feedstocks/packages is a lot of overhead and noise for something that isn't usable by anything other than this feedstock.

The script in gh-109 looks interesting. Should work like that I guess; I just forgot why using --depth doesn't work? Seems like a lacking feature in git itself if it doesn't allow a shallow clone.

hmaarrfk · 2022-05-29T07:55:19Z

I think the problem is that conda first clones the main branch with depth 1, then cannot switch to an older tag like version v1.11.0 because it didn't clone it.

It also didn't play well with caching.

It is somewhat of a job to unbundle but i guess I find it worthwhile if it means we can release this more easily. I'm hoping i can patch things in a way that is acceptable upstream.

hmaarrfk · 2022-05-29T08:01:21Z

I remember what conda tried to do:

it tried to clone a bare repo locally.
then uses that as a cache to clone the sources for the build.

That's not super valuable for CI workflows, but makes it hard to do a shallow clone. At the time I couldn't think of a solution to propose upstream to conda build.

rgommers · 2022-05-29T09:47:59Z

It is somewhat of a job to unbundle but i guess I find it worthwhile if it means we can release this more easily. I'm hoping i can patch things in a way that is acceptable upstream.

Makes sense. I have no problem with unbundling provided it doesn't change the sources that are built. The kind of unbundling Linux distros do, like "hey this project is pinned to version X of , but we insist on using our own version Y" is much more problematic, because you then build a combo of sources that is not tested at all in upstream CI, and may be plain buggy.

hmaarrfk · 2022-05-29T11:04:31Z

We kinda do that with along of c dependencies don't we (not as much in pytorc)

My hope is that i can split off onnx and ATen in versions that match pytorch.

hmaarrfk · 2022-05-29T12:39:28Z

You can follow somewhat of a first pass at step 2 here conda-forge/staged-recipes#19103 (comment)

There are quite a few header only, and other libraries, that get downloaded on the fly using custom cmake code.

That isn't really fun. So even what I did in gh-109 isn't really complete, in terms of not downloading during bulid.

h-vetinari · 2022-05-29T13:32:17Z

That isn't really fun. So even what I did in gh-109 isn't really complete, in terms of not downloading during bulid.

That list of submodules is insane. Reminds me of what I quipped in #76 when I first came across that:

Pytorch has [...] under third_party/ (side note: holy maccaroni, that folder is a packager's nightmare 😅).

Seems even that was underestimating the extent of the issue. Unsurprisingly, I really dislike this "we vendor specific commits of open source projects" development model - it's a very "my castle" approach.

On the other hand, I see where it is coming from, with C/C++'s complete lack of standardised tooling around distribution.

h-vetinari · 2022-05-29T13:42:34Z

But I don't get so many things in that list, especially mature projects. Why vendor six? tbb? fmt? pybind11? The list goes on.

All in all, I fully support ripping this apart one by one (hopefully even in ways that would be palatable upstream), but I get Ralf's point about not diverging from what's actually being tested - though I'd be fine to caveat that based on an actually conceivable risk of breakage (e.g. if there are no functional changes between the vendored commit and a released version in a given submodule)

hmaarrfk · 2022-05-29T13:49:51Z

On the other hand, I see where it is coming from, with C/C++'s complete lack of standardised tooling around distribution.

Right. This is likely what the original creators were grappling with. They decided to either use git submodules in certain projects, or cmake code to download things they needed. Bazel does the same.

But I don't get so many things in that list, especially mature projects. Why vendor six? tbb? fmt? pybind11? The list goes on.

The issue occurs on who is in charge of the support. pytorch (and facebook) cannot force six or tbb to push a fix if their users (other developers at facebook) find a problem. Eventually, one user will have an issue. Because they have the developer resources, they decide to take on the responsibility of maintaining it for their package.

When pip was the only option, you were beholden to the creator of the original package on pypi pleading them to support a feature you need (i've been there many time, and in a sense, we are there with our packaging asks for pytorch).

h-vetinari · 2022-05-29T14:30:33Z

The issue occurs on who is in charge of the support. pytorch (and facebook) cannot force six or tbb to push a fix if their users (other developers at facebook) find a problem. Eventually, one user will have an issue. Because they have the developer resources, they decide to take on the responsibility of maintaining it for their package.

Sure, but what's missing IMO is closing the loop to a released version with the bugfix afterwards.

hmaarrfk · 2022-05-29T14:37:50Z

Sure, but what's missing IMO is closing the loop to a released version with the bugfix afterwards.

Its pretty hard to make a business case as to why you should spend a few hours, and likely more time, submitting a fix upstream after you have fixed things for your users.

Anyway, i'm just going through listing things that need to be done. There are a few big packages that we might be able to take advantage from.

rgommers · 2022-05-29T14:44:53Z

I think you're both missing a very important point: dependencies are fragile. Once you have O(25) dependencies, and would actually express them as dependencies, you become susceptible to a ton of extra bugs (even aside from a ton more packaging/distribution issues). It simply isn't workable.

I had to explain the same thing when SciPy added Boost as a submodule. Boost 1.75 was known to work, 1.76 is known to be broken, yet other versions are unknown. Having a single tested version limits the surface area of things you're exposed to, and also makes it easier to reproduce and debug issues. PyTorch has zero interesting dependencies at runtime (not even numpy anymore), and only one config of build-time library dependencies that are vendored in third-party/.

There's a few libraries, e.g. pybind11, that are well-maintained and low-risk to unbundle. But most of them aren't like that.

There's of a course a trade-off here- in build time, and "let's find bugs early so we can get them fixed for the greater good", but on average PyTorch is doing the right thing here if they want users to have a good experience.

Why vendor six?

six is designed to be vendored. As are other such simple utilities, like versioneer. It's not strange at all - dependencies are expensive.

hmaarrfk · 2022-05-29T15:09:17Z

rgommer, i really agree with:

dependencies are fragile.

which is why i brought up the case of support. You want to be able to control it if you are in charge of shipping a product.

I actually think we should likely skip the unbundling, and build an intermediary output instead. I'm mostly using this effort to try to understand them, and understand their build system.

I changed the top post to reflect this, listing "unbundling" and the intermediary library as two distinct options (potentially complimentary).

hmaarrfk · 2022-05-29T18:26:20Z

So I think I've gone as far as I want to. I actually got to the point where I was exactly 1 year ago, where I was trying to build ideep.

conda-forge/staged-recipes#7491

Ultimately, my concern isn't the fact that I can build it, I think I can. But rather my concern is wether or not I can build it with similar enough options that pytorch tests with. That i'm not super excited about.

h-vetinari · 2022-05-30T14:11:44Z

I think you're both missing a very important point: dependencies are fragile. Once you have O(25) dependencies, and would actually express them as dependencies, you become susceptible to a ton of extra bugs (even aside from a ton more packaging/distribution issues). It simply isn't workable.

I agree with you on a lot of this, but let's please avoid assuming who's missing this point or that. I didn't say that everything should to be a direct dependency, or that that there can't be good reasons for moving to unreleased commits as a stopgap measure (with a work item to move back to a released version as it becomes available), or that it's inherently bad practice (the lack of good tooling forces projects into making really bad trade-offs, but disliking that state of affairs is not an accusation towards anyone).

But with ~60 submodules, not doing that makes integration work pretty much impenetrable, as we've seen for pytorch & tensorflow. I get that this discipline (or extra infrastructure for not using the vendored sources) has low perceived value for companies like Google and Meta, and this is a large part how the situation got to this point (in addition to the lack of good tooling e.g. like cargo).

I don't claim to have the answer (mostly questions) - if someone had a cookie-cutter solution, we'd have seen it by now. I still think that untangling this web of dependencies (possibly also into intermediate pieces) would be very worthwhile both for conda-forge itself and for upstream. Sadly, tensorflow hasn't even shown slight interest in fixing their circular build dependencies, so it's an uphill battle, and we have quite a ways to go on that...

hmaarrfk · 2022-05-30T18:00:35Z

@h-vetinari if you want to help on this effort, I think packaging onnx-tensorrt would be very helpful and is quite independent from the effort here.

I don't think it is as easy to plug it in, but I think it does add to the compilation time since I think it is GPUaware. So is fbgemm.

hmaarrfk · 2022-05-30T18:41:36Z

actually, just building libonnx would likely be a welcome first step!

hmaarrfk · 2022-05-30T19:04:07Z

But I don't get so many things in that list, especially mature projects. Why vendor six? tbb? fmt? pybind11? The list goes on.

In all fairness, they do provide overrides to "mature projects". We just never felt it was a good idea to use them since they don't really move the needle in terms of compilation time.

Ultimately, it is the projects that are "less mature" that they are using exact commits to.

Again, in fairness to them, these are fast moving projects, that seems to have been built quickly, for the specific use case of enabling caffe/caffe2/torch/pytorch.

The other category seems to be GPU packages that need to be built harmoniously with pytorch. Honestly, this feels a little bit like a "conda-forge" problem in the sense that if we had more than 6 hours of compilation time, and likely more than 2 cores to compile on, we could build in the prescribed amount of time.

Pytorch is:

Documenting their versions
Not depending on any closed source build system

Which is honestly more than we can hope for.

h-vetinari · 2022-05-31T07:59:23Z

Pytorch is:

Documenting their versions

Not depending on any closed source build system

Which is honestly more than we can hope for.

Yes, that's a great start. I disagree that we can't have higher aspirations though. 🙃

Honestly, this feels a little bit like a "conda-forge" problem in the sense that if we had more than 6 hours of compilation time, and likely more than 2 cores to compile on, we could build in the prescribed amount of time.

Indisputably, though 6h is already a whole bunch more than we had in pre-azure days. "capable of building on public CI" (in some sequence of individual chunks) is not an unreasonable wish I think.

@h-vetinari if you want to help on this effort, I think packaging onnx-tensorrt would be very helpful and is quite independent from the effort here.

Yes, interested, but low on time at the moment...

rgommers · 2022-05-31T08:51:00Z

Indisputably, though 6h is already a whole bunch more than we had in pre-azure days. "capable of building on public CI" (in some sequence of individual chunks) is not an unreasonable wish I think.

Agreed, that would be a good thing to have, and a reasonable ask to upstream (which I'll make next time I meet with build & packaging folks). Looking at the updated table, there's only a couple of builds that don't fit and they're not ridiculously far from the limit: ~ 6h+ (3933/4242 309 remaining). That said, breaking it in half so it comfortably fits would be better.

Another thing that is likely coming in a future release is the ability to reuse the non-CPython-specific parts between builds. Because 95% of the build doesn't depend on Python version, so having to rebuild everything 4 times for each supported Python version is a massive waste.

hmaarrfk · 2022-05-31T10:35:09Z

@rgommers FWIW, you essentially fly through most of the builds until you get to the large GPU kernels which need to be compiled for every data type, every GPU architecture, and then all put together. So the "3000 files to compile" vs "1800" is really misleading since only 500 files take the compilation time.

As for building as a library: by adjusting the tests, I should be a good place to get the CPU build of #112 working. It doesn't seem to move the needle very much. Again, due to the fact that the intensive stuff still takes as much time as it did before. (The CPU build still takes about 2 hours even without the python stuff).

hmaarrfk · 2022-05-31T11:25:17Z

ok, i spoke too soon. While you can disable BUILD_PYTHON by setting it to OFF or 0.

It seems to be hard to USE the prebuild library that you install in an earlier run.

There seems to be 3 natural checkpoints that they create for their own reasons that might be helpful to us. These checkpoint already get installed, but in their standard build process get "copied" into the python module (as required by pip installed packages)

libc10
libtorch_cpu (this seems like it contains some GPU symbols too -- This seems to take 1.5hrs for CPU only builds. so on the 6 hour constrained GPU builds, splitthing this off as an extra package would be helpful.
libtorch_gpu

They all seem to get assembled by libtorch

hmaarrfk · 2022-06-01T23:51:58Z

I'm not really sure conda is setup to detect the precise hardware, but rather the version of the cuda library.

It is quite hard to choose a hardware cutoff value. I don't really want to be choosing it at this level.

I personally have some setups with new and old GPUs. Crazy right! Though I may be an exception. I would be happy if things worked on my fancy new one.

hmaarrfk · 2022-06-02T01:55:09Z

even more radically, we can even try to split our packages into compute targets, not cuda targets.

Please open an other issue regarding dropping architectures.

Maybe: https://github.com/conda-forge/conda-forge.github.io/issues?q=is%3Aissue+is%3Aopen+gpu

ngam · 2022-06-06T00:54:46Z

Do you think it would be reasonable to ask them to:

Split off the pytorch build more cleanly from the libtorch build?

This, in theory, should be quite beneficial to upstream. Btw, mxnet does exactly this and it works quite well in my experience. Mentioning mxnet as an option in case you want to see their build setup. (I am not sure about number 2 in this list; I don't have a full understanding)

hadim · 2022-12-17T17:38:40Z

If the main motivation to split pytorch in smaller packages is because of CI time constraints then what about GH Large Runners?

Just throwing an idea here in case it can decrease the maintenance burden. It seems more and more important as more and more cf packges are built against pytorch.

hmaarrfk · 2022-12-17T17:43:19Z

If the main motivation to split pytorch in smaller packages is because of CI time constraints

This is an important motivation. And likely the most critical one.

As a second bonus, i would rather not have 4x the number of uploads for each python version.

then what about GH Large Runners?

I'm not sure how to use them at Conda-forge. Do you know how to enable it? PR welcome!

hadim · 2022-12-17T17:54:13Z

We use github_actions as the main CI in our private conda-forge-like organization but it seems like it's not allowed to do that on conda-forge.

When editing conda-forge.yml and adding:

provider:
  linux_64: ["github_actions"]
  osx_64: ["github_actions"]
  win_64: ["github_actions"]

then regeneration fails because of:

INFO:conda_smithy.configure_feedstock:Applying migrations: /tmp/tmpba_a2ikw/share/conda-forge/migrations/python311.yaml
Traceback (most recent call last):
  File "/home/hadim/local/micromamba/bin/conda-smithy", line 10, in <module>
    sys.exit(main())
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/cli.py", line 670, in main
    args.subcommand_func(args)
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/cli.py", line 486, in __call__
    self._call(args, tmpdir)
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/cli.py", line 491, in _call
    configure_feedstock.main(
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/configure_feedstock.py", line 2289, in main
    render_github_actions(env, config, forge_dir, return_metadata=True)
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/configure_feedstock.py", line 1275, in render_github_actions
    return _render_ci_provider(
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/configure_feedstock.py", line 653, in _render_ci_provider
    raise RuntimeError(
RuntimeError: Using github_actions as the CI provider inside conda-forge github org is not allowed in order to avoid a denial of service for other infrastructure.

Also that would only enable the regular GH Actions workers and not the large runners ones for which I think we must pay (that being said it's probably worth putting some money on this, happy to contribute as well).

@hmaarrfk do you think it would be possible to make an exception here by enabling GH Actions as CI only for that repo? That would be only to perform a couple of build experiments and check whether it's worth or not before moving to potentially large runners.

hmaarrfk · 2022-12-17T17:59:39Z

that being said it's probably worth putting some money on this, happy to contribute as well)

Hmm. I'm not sure how donations are managed. Lets not get side tracked by this conversation here. but maybe you can express your desires in https://github.com/conda-forge/conda-forge.github.io for greater visibility.

do you think it would be possible to make an exception here by enabling GH Actions as CI only for that repo? That would be only to perform a couple of build experiments and check whether it's worth or not before moving to potentially large runners.

you can probably edit out the check in configure_feedstock.py yourself. have you tried that?

h-vetinari · 2022-12-17T20:21:35Z

that being said it's probably worth putting some money on this, happy to contribute as well)

Hmm. I'm not sure how donations are managed. Lets not get side tracked by this conversation here. but maybe you can express your desires in https://github.com/conda-forge/conda-forge.github.io for greater visibility.

See here. There have been ongoing efforts to get something like this done for well over two years, but there are a lot of moving pieces (not all of them technical) to sort out.

h-vetinari · 2023-02-13T05:21:37Z

Just saw this recent upstream issue about splitting off a libtorch -- that would be amazing for us. Given the 6h timeout limit, I'd suggest we build this on a different feedstock and then depend on it here.

hmaarrfk · 2023-11-02T02:32:12Z

I feel like it might be time to try again with pytorch 2.x.... i'm just kinda tired of locking up some of my servers compiling this stuff.

carterbox · 2023-11-02T14:04:20Z

I've been running build benchmarks recently by piping the build logs through ts. I don't have any results yet, but the ideas I've been playing with are:

Building for major archs only
Trying to speed up linking by using mold instead of ld
Playing with NVCC compile options: specifically --threads which was introduced in CUDA 11.5 and separable compilation

I'm compiling libtorch without python. If I can't get that below 6 hours with 2 cores, then it's still not plausible to build the entire package on the feedstock.

hmaarrfk · 2023-11-02T17:16:02Z

I would be happy just having to build one or two libraries, to then start a ci job for all the different python packages.

these libraries could be built in a different feedstocks if needed.

carterbox · 2023-11-02T18:57:15Z

🤔 You are suggesting that you would built libtorch offline (at most 2 archs x 3 platforms x 2 blas x 2 cuda), then the feedstock would build pytorch (at most 4 python x 2 archs x 3 platforms).

platforms - osx, win, linux
archs - arm, ppc64le, x86
blas - mkl, openblas
cuda - 11.8, 12.0

That makes some sense. Do we already have a feel for how much time it takes to compile libtorch vs the python extension modules? Do the Python extension modules even have a CUDA dependence or do they just link to any libtorch_cuda.so?

hmaarrfk · 2023-11-03T01:29:53Z

platforms would be limited to at most linux + cuda.

Others seem fine

carterbox · 2023-11-04T05:57:54Z

Here's the results from my local machine for build-only (no setup including cloning or downloading deps), the build time difference between -DBUILD_PYTHON:BOOL=ON/OFF seems negligible.

On my machine using cmake --parallel 2, CUDA 12.0, and nvcc --threads 2 :

4.50 hours major and minor archs listed in the current recipe
2.75 hours only major archs
2.50 hours only major archs using mold as the linker.

No sure how much slower it will be running in a docker container on the CI.

In summary, the most immediate strategy for reducing build times which is not discussed above should be to prune cuda target archs to major only. This may reduce build times by somewhere between half and a third? Who knows, it might bring build time down to an unreliable 5.9 hours. 😆

As mentioned above, patch work with upstream on the CMakeLists so that pytorch can be build separately from libtorch (in another feedstock) would be probably be helpful too. Since the python specific build time seems negligible, this won't reduce build time for a single variant but should reduce the build matrix and thus build time over all variants.

hmaarrfk · 2023-11-04T16:43:20Z

I don't really want to have the conversation of the supported architectures in individual feedstocks.

Can we have the discussion a more central location like:
https://github.com/conda-forge/cuda-feedstock

Then maybe we can have best practices established.

isuruf · 2023-11-04T16:47:44Z

Agree with @hmaarrfk. Please have a look at #114 too.

carterbox · 2023-11-04T21:12:33Z

I already started a discussion about standardizing the archs that feedstocks target at the conda-forge.github.io repo conda-forge/conda-forge.github.io#1901 I'd be happy to move the discussion there. I don't think the cuda-feedstock is the place for that discussion because it's not an issue with the cuda package itself, it's a discussoin about our channel policy and is more similar to whether on not packages should target special instruction sets like AVIX-512.

hmaarrfk · 2024-10-08T01:39:37Z

In 2024, the most important aspect might be the fact that it currently takes about 1hour on our CIs to get through intel ideep. if we unvendor it, it would could speed up iteration time on our linux CIs.

Patch pytorch to use the sysem mkldnn library from the onednn package rather than building one locally from within ideep submodules. Given that ideep itself is a header-only library, I presume this is what was meant in conda-forge#108 (comment), and indeed unvendoring onednn seems to improve build time significantly. That said, our onednn package does not support GPU runtime (conda-forge/onednn-feedstock#44) but at least according to my testing, that part of the library was not enabled by our PyTorch builds before (due to missing SYCL). The patch is a bit hacky, and probably needs some polishing before being submitted upstream (and testing on other platforms). Part of issue conda-forge#108

Patch pytorch to use the system mkldnn library from the onednn package rather than building one locally from within ideep submodules. Given that ideep itself is a header-only library, I presume this is what was meant in conda-forge#108 (comment), and indeed unvendoring onednn seems to improve build time significantly. That said, our onednn package does not support GPU runtime (conda-forge/onednn-feedstock#44) but at least according to my testing, that part of the library was not enabled by our PyTorch builds before (due to missing SYCL). The patch is a bit hacky, and probably needs some polishing before being submitted upstream (and testing on other platforms). Part of issue conda-forge#108

mgorny · 2024-11-15T13:29:21Z

In 2024, the most important aspect might be the fact that it currently takes about 1hour on our CIs to get through intel ideep. if we unvendor it, it would could speed up iteration time on our linux CIs.

For the record, as I've noted in #289, ideep is a header-only library and the part taking lots of time is mkldnn (AKA oneDNN) — which for some reason is vendored inside ideep but built directly.

That said, I've done some timings, using the non-CUDA build for a start. According to my numbers, mkldnn took around 6 minutes here, and the next unvendoring candidate would be XNNPACK — at a glance, it seems to take around 3 minutes, but it's hard to get exact numbers, because it isn't built in one big chunk here, but split between other libtorch files.

Beyond these two dependencies, I didn't notice anything else taking significant time. I've used linux_64_blas_implgenericc_compiler_version13c_stdlib_version2.17cuda_compilerNonecuda_compiler_versionNonecxx_compiler_version13 setup for these timings.

hmaarrfk · 2024-11-15T15:29:56Z

at a glance, it seems to take around 3 minutes, but it's hard to get exact numbers, because it isn't built in one big chunk here, but split between other libtorch files.

This is what I recall in my build times.

I think the onednn is a good clue. It takes approximately 1 hour for that compilation to happen on our CIs.

https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=1075634&view=logs&jobId=b2359324-026b-5f33-b53d-e15b134e3e00

But as you said, it is hard to estimate the "real world" improvements. here especially if they come at the cost of complexity in package maintenance in the onednn feedstock.

hmaarrfk added question Further information is requested enhancement New feature or request help wanted Extra attention is needed and removed question Further information is requested labels May 28, 2022

ngam mentioned this issue May 29, 2022

[WIP] Add pytorch dependencies conda-forge/staged-recipes#19103

Open

9 tasks

h-vetinari mentioned this issue May 31, 2022

Re-enable kineto submodule #76

Open

hmaarrfk mentioned this issue Jun 7, 2022

Fix rpaths conda-forge/openmm-torch-feedstock#18

Merged

5 tasks

hmaarrfk mentioned this issue Oct 29, 2022

Adding torchaudio recipe conda-forge/staged-recipes#20932

Closed

10 tasks

hadim mentioned this issue Dec 17, 2022

Testing GH Actions instead of Azure #154

Closed

5 tasks

carterbox mentioned this issue Oct 5, 2023

NEW: Add libgloo conda-forge/staged-recipes#24154

Closed

10 tasks

hmaarrfk mentioned this issue Nov 6, 2023

Mega builds instead of tiny builds #202

Closed

hmaarrfk mentioned this issue Oct 8, 2024

Help-Wanted Priority List #273

Open

mgorny mentioned this issue Nov 8, 2024

Use system mkldnn/onednn #289

Draft

5 tasks

h-vetinari mentioned this issue Nov 14, 2024

Use upstream's new split build #272

Open

rgommers mentioned this issue Dec 4, 2024

Use more system libraries #60

Closed

Splitting this package in managable chunks #108

Splitting this package in managable chunks #108

Comments

hmaarrfk commented May 28, 2022 • edited Loading

Comment:

Step 1: No more git clone

Option 1: Split off Dependencies:

Option 2 - step 1: Build a libpytorch package or something

Option 2 - step 2: Depend on new ATen/libpytorch package

Compilation time progress

rgommers commented May 29, 2022

hmaarrfk commented May 29, 2022

hmaarrfk commented May 29, 2022

rgommers commented May 29, 2022

hmaarrfk commented May 29, 2022

hmaarrfk commented May 29, 2022

h-vetinari commented May 29, 2022 • edited Loading

h-vetinari commented May 29, 2022

hmaarrfk commented May 29, 2022

h-vetinari commented May 29, 2022

hmaarrfk commented May 29, 2022

rgommers commented May 29, 2022

hmaarrfk commented May 29, 2022

hmaarrfk commented May 29, 2022

h-vetinari commented May 30, 2022

hmaarrfk commented May 30, 2022

hmaarrfk commented May 30, 2022

hmaarrfk commented May 30, 2022

h-vetinari commented May 31, 2022

rgommers commented May 31, 2022

hmaarrfk commented May 31, 2022

hmaarrfk commented May 31, 2022

hmaarrfk commented Jun 1, 2022

hmaarrfk commented Jun 2, 2022

ngam commented Jun 6, 2022 • edited Loading

hadim commented Dec 17, 2022

hmaarrfk commented Dec 17, 2022

hadim commented Dec 17, 2022

hmaarrfk commented Dec 17, 2022

h-vetinari commented Dec 17, 2022

h-vetinari commented Feb 13, 2023

hmaarrfk commented Nov 2, 2023

carterbox commented Nov 2, 2023

hmaarrfk commented Nov 2, 2023

carterbox commented Nov 2, 2023

hmaarrfk commented Nov 3, 2023

carterbox commented Nov 4, 2023

hmaarrfk commented Nov 4, 2023

isuruf commented Nov 4, 2023

carterbox commented Nov 4, 2023

hmaarrfk commented Oct 8, 2024

mgorny commented Nov 15, 2024

hmaarrfk commented Nov 15, 2024

hmaarrfk commented May 28, 2022 •

edited

Loading

h-vetinari commented May 29, 2022 •

edited

Loading

ngam commented Jun 6, 2022 •

edited

Loading