-
-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitting this package in managable chunks #108
Comments
Cloning with The script in gh-109 looks interesting. Should work like that I guess; I just forgot why using |
I think the problem is that conda first clones the main branch with depth 1, then cannot switch to an older tag like version It also didn't play well with caching. It is somewhat of a job to unbundle but i guess I find it worthwhile if it means we can release this more easily. I'm hoping i can patch things in a way that is acceptable upstream. |
I remember what conda tried to do:
That's not super valuable for CI workflows, but makes it hard to do a shallow clone. At the time I couldn't think of a solution to propose upstream to conda build. |
Makes sense. I have no problem with unbundling provided it doesn't change the sources that are built. The kind of unbundling Linux distros do, like "hey this project is pinned to version X of , but we insist on using our own version Y" is much more problematic, because you then build a combo of sources that is not tested at all in upstream CI, and may be plain buggy. |
We kinda do that with along of c dependencies don't we (not as much in pytorc) My hope is that i can split off onnx and ATen in versions that match pytorch. |
You can follow somewhat of a first pass at step 2 here conda-forge/staged-recipes#19103 (comment) There are quite a few header only, and other libraries, that get downloaded on the fly using custom cmake code. That isn't really fun. So even what I did in gh-109 isn't really complete, in terms of not downloading during bulid. |
That list of submodules is insane. Reminds me of what I quipped in #76 when I first came across that:
Seems even that was underestimating the extent of the issue. Unsurprisingly, I really dislike this "we vendor specific commits of open source projects" development model - it's a very "my castle" approach. On the other hand, I see where it is coming from, with C/C++'s complete lack of standardised tooling around distribution. |
But I don't get so many things in that list, especially mature projects. Why vendor six? tbb? fmt? pybind11? The list goes on. All in all, I fully support ripping this apart one by one (hopefully even in ways that would be palatable upstream), but I get Ralf's point about not diverging from what's actually being tested - though I'd be fine to caveat that based on an actually conceivable risk of breakage (e.g. if there are no functional changes between the vendored commit and a released version in a given submodule) |
Right. This is likely what the original creators were grappling with. They decided to either use git submodules in certain projects, or cmake code to download things they needed. Bazel does the same.
The issue occurs on who is in charge of the support. pytorch (and facebook) cannot force When pip was the only option, you were beholden to the creator of the original package on |
Sure, but what's missing IMO is closing the loop to a released version with the bugfix afterwards. |
Its pretty hard to make a business case as to why you should spend a few hours, and likely more time, submitting a fix upstream after you have fixed things for your users. Anyway, i'm just going through listing things that need to be done. There are a few big packages that we might be able to take advantage from. |
I think you're both missing a very important point: dependencies are fragile. Once you have O(25) dependencies, and would actually express them as dependencies, you become susceptible to a ton of extra bugs (even aside from a ton more packaging/distribution issues). It simply isn't workable. I had to explain the same thing when SciPy added Boost as a submodule. Boost 1.75 was known to work, 1.76 is known to be broken, yet other versions are unknown. Having a single tested version limits the surface area of things you're exposed to, and also makes it easier to reproduce and debug issues. PyTorch has zero interesting dependencies at runtime (not even There's a few libraries, e.g. There's of a course a trade-off here- in build time, and "let's find bugs early so we can get them fixed for the greater good", but on average PyTorch is doing the right thing here if they want users to have a good experience.
|
rgommer, i really agree with:
which is why i brought up the case of support. You want to be able to control it if you are in charge of shipping a product. I actually think we should likely skip the unbundling, and build an intermediary output instead. I'm mostly using this effort to try to understand them, and understand their build system. I changed the top post to reflect this, listing "unbundling" and the intermediary library as two distinct options (potentially complimentary). |
So I think I've gone as far as I want to. I actually got to the point where I was exactly 1 year ago, where I was trying to build ideep. conda-forge/staged-recipes#7491 Ultimately, my concern isn't the fact that I can build it, I think I can. But rather my concern is wether or not I can build it with similar enough options that pytorch tests with. That i'm not super excited about. |
I agree with you on a lot of this, but let's please avoid assuming who's missing this point or that. I didn't say that everything should to be a direct dependency, or that that there can't be good reasons for moving to unreleased commits as a stopgap measure (with a work item to move back to a released version as it becomes available), or that it's inherently bad practice (the lack of good tooling forces projects into making really bad trade-offs, but disliking that state of affairs is not an accusation towards anyone). But with ~60 submodules, not doing that makes integration work pretty much impenetrable, as we've seen for pytorch & tensorflow. I get that this discipline (or extra infrastructure for not using the vendored sources) has low perceived value for companies like Google and Meta, and this is a large part how the situation got to this point (in addition to the lack of good tooling e.g. like cargo). I don't claim to have the answer (mostly questions) - if someone had a cookie-cutter solution, we'd have seen it by now. I still think that untangling this web of dependencies (possibly also into intermediate pieces) would be very worthwhile both for conda-forge itself and for upstream. Sadly, tensorflow hasn't even shown slight interest in fixing their circular build dependencies, so it's an uphill battle, and we have quite a ways to go on that... |
@h-vetinari if you want to help on this effort, I think packaging onnx-tensorrt would be very helpful and is quite independent from the effort here. I don't think it is as easy to plug it in, but I think it does add to the compilation time since I think it is GPUaware. So is fbgemm. |
actually, just building libonnx would likely be a welcome first step! |
In all fairness, they do provide overrides to "mature projects". We just never felt it was a good idea to use them since they don't really move the needle in terms of compilation time. Ultimately, it is the projects that are "less mature" that they are using exact commits to. Again, in fairness to them, these are fast moving projects, that seems to have been built quickly, for the specific use case of enabling caffe/caffe2/torch/pytorch. The other category seems to be GPU packages that need to be built harmoniously with pytorch. Honestly, this feels a little bit like a "conda-forge" problem in the sense that if we had more than 6 hours of compilation time, and likely more than 2 cores to compile on, we could build in the prescribed amount of time. Pytorch is:
Which is honestly more than we can hope for. |
Yes, that's a great start. I disagree that we can't have higher aspirations though. 🙃
Indisputably, though 6h is already a whole bunch more than we had in pre-azure days. "capable of building on public CI" (in some sequence of individual chunks) is not an unreasonable wish I think.
Yes, interested, but low on time at the moment... |
Agreed, that would be a good thing to have, and a reasonable ask to upstream (which I'll make next time I meet with build & packaging folks). Looking at the updated table, there's only a couple of builds that don't fit and they're not ridiculously far from the limit: ~ Another thing that is likely coming in a future release is the ability to reuse the non-CPython-specific parts between builds. Because 95% of the build doesn't depend on Python version, so having to rebuild everything 4 times for each supported Python version is a massive waste. |
@rgommers FWIW, you essentially fly through most of the builds until you get to the large GPU kernels which need to be compiled for every data type, every GPU architecture, and then all put together. So the "3000 files to compile" vs "1800" is really misleading since only 500 files take the compilation time. As for building as a library: by adjusting the tests, I should be a good place to get the CPU build of #112 working. It doesn't seem to move the needle very much. Again, due to the fact that the intensive stuff still takes as much time as it did before. (The CPU build still takes about 2 hours even without the python stuff). |
ok, i spoke too soon. While you can disable It seems to be hard to USE the prebuild library that you install in an earlier run. There seems to be 3 natural checkpoints that they create for their own reasons that might be helpful to us. These checkpoint already get installed, but in their standard build process get "copied" into the python module (as required by pip installed packages)
They all seem to get assembled by |
I'm not really sure conda is setup to detect the precise hardware, but rather the version of the cuda library. It is quite hard to choose a hardware cutoff value. I don't really want to be choosing it at this level. I personally have some setups with new and old GPUs. Crazy right! Though I may be an exception. I would be happy if things worked on my fancy new one. |
Please open an other issue regarding dropping architectures. Maybe: https://github.com/conda-forge/conda-forge.github.io/issues?q=is%3Aissue+is%3Aopen+gpu |
This, in theory, should be quite beneficial to upstream. Btw, mxnet does exactly this and it works quite well in my experience. Mentioning mxnet as an option in case you want to see their build setup. (I am not sure about number 2 in this list; I don't have a full understanding) |
If the main motivation to split pytorch in smaller packages is because of CI time constraints then what about GH Large Runners? Just throwing an idea here in case it can decrease the maintenance burden. It seems more and more important as more and more cf packges are built against pytorch. |
This is an important motivation. And likely the most critical one. As a second bonus, i would rather not have 4x the number of uploads for each python version.
I'm not sure how to use them at Conda-forge. Do you know how to enable it? PR welcome! |
We use When editing provider:
linux_64: ["github_actions"]
osx_64: ["github_actions"]
win_64: ["github_actions"] then regeneration fails because of:
Also that would only enable the regular GH Actions workers and not the large runners ones for which I think we must pay (that being said it's probably worth putting some money on this, happy to contribute as well). @hmaarrfk do you think it would be possible to make an exception here by enabling GH Actions as CI only for that repo? That would be only to perform a couple of build experiments and check whether it's worth or not before moving to potentially large runners. |
Hmm. I'm not sure how donations are managed. Lets not get side tracked by this conversation here. but maybe you can express your desires in https://github.com/conda-forge/conda-forge.github.io for greater visibility.
you can probably edit out the check in configure_feedstock.py yourself. have you tried that? |
See here. There have been ongoing efforts to get something like this done for well over two years, but there are a lot of moving pieces (not all of them technical) to sort out. |
Just saw this recent upstream issue about splitting off a |
I feel like it might be time to try again with pytorch 2.x.... i'm just kinda tired of locking up some of my servers compiling this stuff. |
I've been running build benchmarks recently by piping the build logs through
I'm compiling libtorch without python. If I can't get that below 6 hours with 2 cores, then it's still not plausible to build the entire package on the feedstock. |
I would be happy just having to build one or two libraries, to then start a ci job for all the different python packages. these libraries could be built in a different feedstocks if needed. |
🤔 You are suggesting that you would built libtorch offline (at most 2 archs x 3 platforms x 2 blas x 2 cuda), then the feedstock would build pytorch (at most 4 python x 2 archs x 3 platforms). platforms - osx, win, linux That makes some sense. Do we already have a feel for how much time it takes to compile libtorch vs the python extension modules? Do the Python extension modules even have a CUDA dependence or do they just link to any libtorch_cuda.so? |
platforms would be limited to at most linux + cuda. Others seem fine |
Here's the results from my local machine for build-only (no setup including cloning or downloading deps), the build time difference between -DBUILD_PYTHON:BOOL=ON/OFF seems negligible. On my machine using
No sure how much slower it will be running in a docker container on the CI. In summary, the most immediate strategy for reducing build times which is not discussed above should be to prune cuda target archs to major only. This may reduce build times by somewhere between half and a third? Who knows, it might bring build time down to an unreliable 5.9 hours. 😆 As mentioned above, patch work with upstream on the CMakeLists so that pytorch can be build separately from libtorch (in another feedstock) would be probably be helpful too. Since the python specific build time seems negligible, this won't reduce build time for a single variant but should reduce the build matrix and thus build time over all variants. |
I don't really want to have the conversation of the supported architectures in individual feedstocks. Can we have the discussion a more central location like: Then maybe we can have best practices established. |
I already started a discussion about standardizing the archs that feedstocks target at the conda-forge.github.io repo conda-forge/conda-forge.github.io#1901 I'd be happy to move the discussion there. I don't think the cuda-feedstock is the place for that discussion because it's not an issue with the cuda package itself, it's a discussoin about our channel policy and is more similar to whether on not packages should target special instruction sets like AVIX-512. |
In 2024, the most important aspect might be the fact that it currently takes about 1hour on our CIs to get through intel ideep. if we unvendor it, it would could speed up iteration time on our linux CIs. |
Patch pytorch to use the sysem mkldnn library from the onednn package rather than building one locally from within ideep submodules. Given that ideep itself is a header-only library, I presume this is what was meant in conda-forge#108 (comment), and indeed unvendoring onednn seems to improve build time significantly. That said, our onednn package does not support GPU runtime (conda-forge/onednn-feedstock#44) but at least according to my testing, that part of the library was not enabled by our PyTorch builds before (due to missing SYCL). The patch is a bit hacky, and probably needs some polishing before being submitted upstream (and testing on other platforms). Part of issue conda-forge#108
Patch pytorch to use the system mkldnn library from the onednn package rather than building one locally from within ideep submodules. Given that ideep itself is a header-only library, I presume this is what was meant in conda-forge#108 (comment), and indeed unvendoring onednn seems to improve build time significantly. That said, our onednn package does not support GPU runtime (conda-forge/onednn-feedstock#44) but at least according to my testing, that part of the library was not enabled by our PyTorch builds before (due to missing SYCL). The patch is a bit hacky, and probably needs some polishing before being submitted upstream (and testing on other platforms). Part of issue conda-forge#108
For the record, as I've noted in #289, ideep is a header-only library and the part taking lots of time is mkldnn (AKA oneDNN) — which for some reason is vendored inside ideep but built directly. That said, I've done some timings, using the non-CUDA build for a start. According to my numbers, mkldnn took around 6 minutes here, and the next unvendoring candidate would be XNNPACK — at a glance, it seems to take around 3 minutes, but it's hard to get exact numbers, because it isn't built in one big chunk here, but split between other libtorch files. Beyond these two dependencies, I didn't notice anything else taking significant time. I've used |
This is what I recall in my build times. I think the onednn is a good clue. It takes approximately 1 hour for that compilation to happen on our CIs. But as you said, it is hard to estimate the "real world" improvements. here especially if they come at the cost of complexity in package maintenance in the onednn feedstock. |
Comment:
This package currently requires more than 16 builds to be build manually to ensure that it completes in time on the CIs.
Step 1: No more git clone
rgommers identified that one portion of the build process that takes time is cloning the repository. In my experience, cloning the 1.5GB repo can take up to 10 min on my powerful local machine, but I feel like it can take much longer on the CIs.
To avoid cloning, we will have to list out all the submodule manually, or make the conda-forge installable dependencies.
I mostly got this working using a recursive script which should help us keep it maintained: #109
Option 1: Split off Dependencies:
pthreadpool
pthreadpool
on OSX.third_party
.fp16.py
cannot be found, the other withpsimd
.Option 2 - step 1: Build a libpytorch package or something
By setting
BUILD_PYTHON=OFF
in #112 we then end up with the following libraries inlib
andinclude
:Option 2 - step 2: Depend on new ATen/libpytorch package
Compilation time progress
3933/4242
309 remaining)3897/4242
345 remaning)3924/4242
318 remaining)1656/1969
313 remaining3962/4242
280 remaining)There are approximately:
The text was updated successfully, but these errors were encountered: