-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[trainer] figuring out why eval with --fp16_full_eval
is 25% slower
#10816
Comments
Hi @stas00, |
Yes, please. |
I reproduced this in colab and got 28% slowness but still figuring out the cause, |
Usually in such situations I try to either go from the bottom up or in reverse. That is just take the You can use this tracker to bracket the operation you measure.
But a totally different approach which might get to the core of the issue much faster is to use a python profiler, .e.g. |
I have done a few measures on 2 different cards (a 3090 and a 2080 Ti) using various evaluation batch sizes, and I haven't observed a single culprit for this problem. Instead, I'm seeing that all the operations in the Setup
These are the results for the main operations inside the The time difference depends on the batch size, but Today I discovered this thread in the PyTorch forums, and repeated the test using a version of PyTorch compiled from source. Amazingly, processing is now almost twice as fast, but the difference is still there: In this case, using a batch size of 128 (1 batch) is about 13% slower, while a batch size of 16 is 27% slower. I'm not sure how to proceed. Does this ring a bell for anyone? |
Thank you for researching and profiling, @pcuenca! I think the next step is the new pytorch profiler: Unfortunately, at the moment I have no time to dig into it, so I hope someone will beat me to it. re: building from source: Indeed, I recently built pytorch from source and I don't know if it's that or something else since 1 month passed since OP was made, but I'm getting 2x speed improvement (rtx-3090) on training this task. eval is only slightly faster, but is still 25% slower @ fp16. Also adapted the cmd line to the recently changed examples:
add
|
By running everything with After removing those checks, this is what I end up with:
The same with
Note that I had to dial up the number of eval examples since this measurement was quite noisy on the shared system I used. However, the FP16 was faster most of the time. If someone could double check these observations under more reliable circumstances, that'll be great. |
Thank you for looking into it, @dsuess! I'm trying to figure out torch.profiler to get a better understanding using native tools. Great to hear you found those checks to be slowdowns. Need to investigate these closer with torch.profiler. And I also found
for fp16 (it makes no difference for fp32) I will look closer into the 2 points you suggested. but also we should run under a more realistic configuration of at least seqlen 512 and not 12 like I had it originally, with large seqlen things change quite a lot. That is |
Thanks for your feedback @stas00. I finally got the time to have a closer look with the pytorch profiler. I'd summarize what I found with:
These conversions happen in the layer norm and before the softmax, which matches with your observation. I also double checked the layer norm with this micro benchmark, which runs ~30% slower in FP16. Judging from the issue you raised, we can't run layer norm in FP16. I'd expect the same to be true for softmax, so I am unsure if we can get rid of those conversions. We may have a chance to get more out of the matmuls, so I'll try to figure out why those kernels don't run on Tensor cores despite being eligible. I've done all these experiments on a 3080Ti with |
This is fantastic work, @dsuess! Here is an additional profiling report of the same issue but under tf32: #14608 (comment) This appears to be specific to t5 and derived models. And yes the problem is that it uses RMSNorm which pytorch doesn't provide and that's why it's slow. I made a request to make an RMSNorm fused kernel here: NVIDIA/apex#1271 and once this is done to ask to upstream it into pytorch. I hope this should solve this issue. I also tried to avoid re-casting using some tricks here by trying to deploy the existing fused functions: #14656 but I couldn't find a faster way using the existing pytorch python API. Have you by chance tried any other architectures using the same benchmarks? e.g. gpt2 and bert as they are very distinct from t5. |
Great benchmark of the different data types, thanks for sharing.
I've just tested the same script with some of the mbart variants and as expected, fp16 is faster for those. |
Recently HF trainer was extended to support full fp16 eval via
--fp16_full_eval
. I'd have expected it to be either equal or faster than eval with fp32 model, but surprisingly I have noticed a 25% slowdown when using it.This may or may not impact deepspeed as well, which also runs eval in fp16, but we can't compare it to a baseline, since it only runs fp16.
I wonder if someone would like to research where the slowdown comes from.
I'd probably isolate the
model.half()
call which should be a constant and focus on the rest of the eval. I'm thinking that some component doesn't take well to fp16 variables. e.g. label smoothing was problematic and now should be fixed in #10815, but I tested w/ and w/o label smoothing and it's not adding to the slowdown.Here are the script and the corresponding metrics.
First w/o
--fp16_full_eval
,now let's add
--fp16_full_eval
:As you can see w/o
--fp16_full_eval
: we get ~22 samples per sec and w/ it only ~17/ - that's a huge difference.I also tested with a larger sample and the gap remains constant.
The halving happens here:
transformers/src/transformers/trainer.py
Line 1800 in 21e86f9
Thank you!
The text was updated successfully, but these errors were encountered: