-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMP] CUDA support for mixed precision pass #8294
Comments
Got this error when running
|
I suspect this has to do with the schedule not actually supporting accumulation dtypes. Can you post the rest of the trace? |
Hmm yeah the problem has to do with what i say. Specifically in In general it seems reasonable to have implicit type promotion to higher bit floating point types. Furthermore, it might also be good to have most binary arithmetic ops to have output_dtypes. E.g. right now there isn't a good way to represent adding two fp16 numbers into a fp32 result. Later NVidia GPUs support this as a more primitive operations so maybe we should have a better representation. |
I'm just going to turn off accumulating to fp32 for now. I don't want to manually look at every single schedule ever written to check for correctness. Turning things off all the unit tests except one pass. The one that doesn't pass is the problem described by @Lunderberg. This ones seems trickier since I don't understand how cuda codegen works at all:
|
Yeah so the failing test with this error |
From what I can tell, the float16 values are packed into uint32 when not in use, and are cast to the float16 when used. I think there will need to be some special handling to pad out the calls to |
Yep, not familiar with CUDA codegen either. I can't seem to trivially find the ops which cause this. Looks like another TODO. @junrushao1994 do you have any ideas? |
With PR #8341 we can tune some models. Results here : https://docs.google.com/spreadsheets/d/12lgyfuHaRS-X4uG-1iQOV8oAuPpuVAbspcmkOSPRFHQ We see good speedups, esp. for BERT. |
I finally finished collecting data on FP16 performance using Tensocore. Since the NHWC conv2d tensorcore schedule requires a batch size multiple of at least 8, all batch sizes are 8. The speed up over FP32 (ansor), which is a strong baseline, is mixed. I expected better performance from tensorcore, but I guess our tensorcore schedules have a room for improvement (also hit a lot of errors when tuning tensorcore schedules, due to invalid schedules). In most cases, we are much slower than TensorRT (not sure if TensorRT All numbers in milli seconds and measured on RTX 3070. All models are in the NHWC layout.
|
I think we can close this now |
Solve issues and make modifications to support CUDA for mixed precision pass here: #8069
Current initial issues as described by @Lunderberg
This issue is completed when unit tests can pass for CUDA target.
The text was updated successfully, but these errors were encountered: