-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Impossible to use the tutorials #1271
Comments
FP16 is not upported on pre-tensorcores GPU. Can you try FP32? |
When using |
If it's a pre-Volta GPU, we don't generate the MMA layout in any means. So perhaps we shouldn't use Feel free to modify the code and contribute. |
Thanks ! I will take a look and see if I can find a way to avoid this issue and make a PR. Any idea for my second issue on |
Not sure how this problem is triggered yet. |
We don't have pre-Volta GPUs to test things out, but we can provide some guidance if you're interesting in debugging the issue. I think the main thing for layer norm would be to figure out why the codegen is any different for your 1080 than for a Volta GPU. All GPUs with compute capability <= 70 should be treated the same 🤔 |
Right, I see the main idea! I will give it a look but since I am a newbie in this kind of stuff, not sure I could go to deep unfortunately... |
I can confirm I am also getting this issue on RTX A6000 |
I also encounter the issue "Argument rematerialization not implemented" when running 05-layer-norm.py on a100-80g. |
Randomly (not every time) getting
when running a custom fused linear layer. (has activation, dropout and scaling) edit: this was actually cuz of layernorm |
Hey @Dj1312 were you able to find a fix for this issue? |
How to fix this for pascal? Even if it's slower. |
Hey @ptillet, I'm trying to debug this issue on my pascal card. I have outlined my particular case in this issue qwopqwop200/GPTQ-for-LLaMa#142. I've swapped the following lines, note this is off of the v2.0.0 tag: with the following:
This results in the following error:
Do you have any suggestions? |
Unfortunately, no... |
So it needs to be casted somehow? But I swear I have run other float16 code. |
Related to #1271 . I am currently working on adding support for Pre-volta GPUs in Triton. --------- Co-authored-by: Himanshu Pathak <himanshu@mtatva.com> Co-authored-by: Philippe Tillet <phil@openai.com>
"Argument rematerialization not implemented" is probably a regression because the tutorials work for me on version |
Our docs build runs nightly without issues on an A100. It's possible there are some troubles on older GPUs unfortunately. I don't have any Pascal GPU I can use so it's hard for me to repro |
Just to add I think people are getting this error from running pip install as that version crashes when doing
on an A100 (cuda 11.8, torch 2.0.0+cu118, triton 2.0.0) (FusedLayerNorm uses this and code from the tutorial) Not clear to me how to get nightly without compiling the code (which if I'm understanding my compilation error correctly requires an advanced version of C++) |
Nightly will be back up soon. Thanks for your patience! In the meantime recompiling the code shouldn't be too difficult |
|
I tried the tutorials on my GTX 970, and didn't get very far. I'm testing on latest main (commit dd2d5f4). 03-matrix-multiplication.py, 06-fused-attention.py, and 08-experimental-block-pointer.py (duplicate lines omitted)
05-layer-norm.py
|
Is there a nightly wheel available somewhere? |
I modified the code as following and it works. # First store doesn't accumulate
if count == 0:
tl.atomic_xchg(Count, 1)
else:
# partial_dw += tl.load(DW, mask=mask)
# partial_db += tl.load(DB, mask=mask)
# ignore the condition of count == 0
partial_dw += tl.load(DW, mask=mask)
partial_db += tl.load(DB, mask=mask)
tl.store(DW, partial_dw, mask=mask)
tl.store(DB, partial_db, mask=mask Maybe this condition triggers something. |
@mikegreen7892003 That will throw an IndentationError, you either need a 'pass' in the else block or you need to comment out the else clause entirely. Also, you're missing a closing parenthesis. |
@cebtenzzre I believe this is because your GPU does not support operating on float16 inputs. Try to edit the tutorial code to use Note for triton developers: instead of crashing with a low level error message for unsupported dtypes, it would be more user friendly to raise a Python-level exception earlier with a higher level error message. At the moment I get on a GTX 1080 TI:
I am not sure how to inspect which dtypes are supported by a given device though. I had a look at: https://pytorch.org/docs/stable/cuda.html but the only think I see would be to manually map the compute capability tuple to a list of supported dtypes. |
Well pascal is unsupported. I mean why support a $200 24G card when everyone can buy $700 3090s or $3000 V100. 7b model should be enough for everyone :P |
…#1505) Related to triton-lang#1271 . I am currently working on adding support for Pre-volta GPUs in Triton. --------- Co-authored-by: Himanshu Pathak <himanshu@mtatva.com> Co-authored-by: Philippe Tillet <phil@openai.com>
…ointer (triton-lang#1272) Addition of a possible pattern for MMA layout propagation when the ConvertLayoutOp is inside the loop, the layout is retrieved from the layout map instead of the ConvertLayoutOp. Addresses Issue: triton-lang#1271 --------- Signed-off-by: Maxime France-Pillois <maxime.francepillois@codeplay.com>
Hi !
I am currently trying to understand how to use Triton with tutorials. Unfortunately, I encounter two different issues:
03-matrix-multiplication.py
and06-fused-attention.py
, I get:The error seems to occurs at the line
Since I have a GTX1080 on my computer, I work with Pascal architecture. The MMA is supported by Volta and Hopper. Nevertheless, is it possible to optimize the matmul for my GTX1080 ?
05-layer-norm.py
, the error isFor this one, I dont have any clue...
Does someone have some thoughts on my issues?
Thanks in advance and regards,
Lucas.
The text was updated successfully, but these errors were encountered: