-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to install 2.0 #358
Comments
Same issue--not a one off |
I'm seeing a similar issue:
|
I also reproduce the error. Setting |
Yeah I personally don't like the fact that we're templating so heavily (for dropout / no dropout, causal / not causal, different head dimensions, whether seqlen is divisible by 128 or not, different GPU types). The goal has been to get maximum performance, perhaps at the expense of compilation time.
|
Yea, it's definitely a hard problem that we've also been hitting in Transformer Engine. It'll take some engineering effort, but I've found NVRTC to be a nice way to avoid the combinatorial explosion from templating. I've found that with the right wrapper classes (see rtc.h and rtc.cpp), it can be straightforward to write and launch JIT kernels (see how transpose.cu calls the kernel in rtc/transpose.cu). It does impose compilation time the first time each kernel is called though, and it is quite annoying to include external headers (including the CUDA Toolkit and C++ Standard Library). |
Unfortunately for me, I still get the same error even if MAX_JOBS=1 is set. Also tried building from source with the same error. Any temporary solution available? |
I tried with
EDIT: That looks like #343 |
seeing the same problem here:
I am setting update: setting |
I got the same error and I modify the setup.py files, ninja -v --> ninja --version. Then I met |
Same issue here.. |
Same here. Despite even attempting a git submodule update --init --recursive --force I still run into this issue...so if anyone finds anything, thank you very much. |
Pretty sure this is an issue with CUDA 12 -- but I don't know whether there's a timeline to support CUDA 12 |
Same issue here |
I solved it by updating CUDA version, hope it helps |
I'm running 12.1 - which works fine with an older commit. |
You can reinstall Python from conda-forge (no need to change the Python version) |
Can you link the commit/hash here? |
As it turns out i may be wrong about that with a submodule issue. |
It would be great if the devs can chip in here.. Is there a timeline for supporting CUDA 12? |
Downgrading all the cuda stuff to 11.8, and gcc to v11 appears to work. |
I spoke too soon. Is this a compatibility issue? Not sure why those kernels aren't building. |
So I need to try with cuda 12.x again just for fun but it appears ninja itself, even if one modifies the call to ninja -v in torch, is causing those files not to build. Uninstalling ninja takes longer to build, however those objects appear to be getting created. |
Can you build it without Ninja? I thought you can't.. |
I'm building xformers which pulls in flash-attention via submodule, which may make a difference, but it implies that you can for a slower build experience. |
It works with MAX_JOBS=1 if you install from git. It will take a long time so be patient. You can select specific cuda version using conda: https://hamel.dev/notes/cuda.html |
It did not here, but maybe I needed to blow away my build directory first. |
changing
|
I was following some half baked advice from a similar issue on github... regardless something odd is going on and MAX_JOBS doesn't really seem to help (although it may be getting ignored despite me exporting it in the console, since I'm building as part of xformers). |
I had encountered the same issue when I built with nvcc 11.6. And the package can be built with nvcc 11.8 |
Same issue here, and could anybody tell me what will happen if I set MAX_jobs=1? |
In case anyone was wondering since I have enough ram and cores to test, I let Ninja do whatever it wanted with thread count and didn't change the setup file except to build for sm_89 in case it worked. RAM usage maxed out at something over half of my available before gradually sloping back down, 268GB was the max I think... All 32/64 cores were at 100% and I'm pretty sure Ninja had at least one process running for every HT core. I didn't count them, but it went from basically parked to 4.3GHz and stayed there. As soon as compiles started finishing (erroring I mean) with constexpr issues due to the use of a non constexpr variable set through passing by reference to another method to instantiate a template cmd.exe began hanging for extended periods with the sheer volume of output spam it was trying to queue up for display... I've spammed it pretty impressively on purpose before (with the entire printable unicode range at 2MB/s no less) and have never seen it hang up on output until now. Had it not been throwing hundreds of thousands of template errors I'd wager it would be quite a bit faster and not have used quite so much memory, but that amount of memory usage is still insane and something is broken with the template instantiation even when it builds ok, I suspect. This probably shouldn't be happening: E:\code\flash-attention\csrc\flash_attn\src\flash_bwd_kernel.h(783): error: no instance of overloaded function "cute::copy" matches the argument list
argument types are: (
cute::TiledCopy
<cute::Copy_Atom
<cute::SM75_U32x4_LDSM_N, cutlass::half_t>,
cute::Layout
<cute::tuple
<cute::tuple<cute::C<4>, cute::_8, cute::_2, cute::_4>,
cute::tuple
<cute::tuple<cute::_2, cute::_2>,
cute::tuple<cute::_2, cute::_1>
>
>, cute::tuple
<cute::tuple<cute::_128, cute::_1, cute::_0, cute::_8>,
cute::tuple
<cute::tuple<cute::_64, cute::_512>,
cute::tuple<cute::C<32>, cute::_0>>
>
>,
cute::tuple
<cute::Layout
<cute::tuple<cute::C<8>, cute::C<4>, cute::_2>,
cute::tuple<cute::_1, cute::_32, cute::_8>
>,
cute::Layout<cute::C<16>, cute::_1>
>
>,
cute::Tensor
<cute::ViewEngine
<cute::smem_ptr<cutlass::half_t>>,
cute::Layout
<cute::tuple
<cute::tuple<cute::_8, cute::_1>, cute::_2,
cute::tuple
<cute::tuple<cute::_2, cute::_2>, cute::_2>>,
cute::tuple
<cute::tuple<cute::_1, cute::_0>,
cute::_1024,
cute::tuple<cute::tuple<int, int>,
cute::C<8192>
>
>
>
>,
<error-type>)
cute::copy(smem_tiled_copy_KV, tdPsV, tdPrV_copy_view); Ignoring how bizarre that is, why is a SM7.5 template being instantiated? I know that wasn't on the command line... Is it a leftover? is every possible arch being built at once and that's the delay and the source of the enormous error count? A long time ago I used to have to do rebuilds of the entire LLVM compiler suite + our product integration with it multiple times a day; it totalled in at something like 50k C++ source files. My work machine had 24GB of ram and 6 cores and I could still play videogames and run a linux VM during the 10 minutes it took for Visual Studio to crank out a full rebuild with the source on a regular SATA2 SSD. While trying to build this didn't make my computer unresponsive or even slow down I have a uniquely gigantic amount of memory installed for a home workstation. Your average current-ish 16 core Ryzen with 32GB of DDR5 because that's as much as you can install without crippling its speed thanks to the scam of XMP would have gone into swap file territory almost instantly and been difficult to kill all the tasks on since extra processes were being spawned for everything. I'd strongly suggest just killing off the ninja part of the build until it's un-screwed-up and maybe look at: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2280r4.html Which might have something to do with all the template constexpr errors, although MS has been pretty on top of new stuff lately. I did try setting the standard to C++20 but it was unclear if it "took" since most command line options were being specified multiple times and things like HALF_FLOAT were being both defined and undefined on the same command line. It might be worth looking into clang-cl from the VS build tools to see if they've implemented something (right or wrong) that makes it work for now, I guess. I'm also not super familiar with CUDA but something doesn't sit quite right about building for SM90 automatically based on CUDA version when the 4090 is SM8.9 (90 is hopper I guess?) but since two different versions get passed into the command line for the build (80 and 90, or in my case the 8.9 I shoved into setup.py to make it build a version I needed) I'd also strongly suggest installing a Windows VM on your linux install, sucking it up, and learning how to use the visual studio compilation tools. They're significantly easier to deal with than clang or GCC as far as that goes, and this kind of template spaghetti factory explosion isn't something you can just sit around and say "I don't know windows somebody else can help" forever if you want a windows build. If you can't or aren't willing to do it there's nothing wrong with just announcing the Windows build is dead. I'd rather it wasn't but realistically projects have to be worked on by their maintainers or at least understandable by new people and I'd personally rather fix bugs in boost::spirit for years. A good first step might be deleting the hundreds of commented out lines of code doing slightly different things with unlabelled values; one of the nice things about source control is that you don't have to keep 5 years of commented out code with no comments on what it did, why it was changed, or what was better or worse laying around in the source tree. That at least gives people something clean to look at if you find somebody willing to take this on. I'm not trying to be rude but man... Somebody might be willing to help you from NVidia or hugging face since this gets used with xformers which doesn't have as much oomph without it apparently and triton can't be built on Windows easily either. |
This is a free and open source project, and I'm maintaining it in my free time. My expertise is not in compiling or building packages. |
I had similar error with TransformerEngine: during installation there is a moment when it needs around 70GB RAM (total).
Now install it. |
Tried pip install and setup.py install both
The text was updated successfully, but these errors were encountered: