Fix layer_normalize gradients #3001

arrufat · 2024-08-28T13:49:33Z

arrufat · 2024-08-28T14:14:48Z

It still doesn't fix the discrepancy between GPU and CPU, but fixes a bug in the implementation.

davisking · 2024-08-29T02:34:07Z

I'm looking at this and not sure what the code is supposed to be doing. Go through the contracts and make sure they are right. Like

    void layer_normalize (
        const double eps, 
        resizable_tensor& dest,
        resizable_tensor& means,
        resizable_tensor& invstds,
        const tensor& src, 
        const tensor& gamma,
        const tensor& beta 
    );   
    /*!  
        requires
            - eps > 0
            - src.num_samples() == gamma.size() == beta.size()
            - have_same_dimensions(gamma, beta) == true
            - beta.num_samples() ==beta.nr() ==gamma.nc() == 1

That's saying beta and gamma are the same shape and all the dimensions are 1 except k, which would have to be src.num_sample(). So by that what was in the code before this would make sense. But running some of these I see that gamma and beta don't have the shape that requires clause says they do. So there is inconsistency in how these are being interpreted. I.e. is gamma.size() == src.num_samples() or is it src.k() * src.nr() * src.nc()?

That's totally at the root of the problem here. Everything starts with having contracts that are right. Then put DLIB_ASSERT statements that check all the requires statements so you know for sure they are not being violated. That will chase down the problem. Although you've got to decide what the arguments are first. Not sure what you want them to be for this layer. I would think gamma.size() == src.num_samples() as that's probably the most typical variant of layer norm though.

davisking · 2024-08-29T02:46:39Z

Inside layer_normalize_gradient() for the cuda code, there are also local resiable_tensor variables (dvars and dmeans), those can't live there. It's real expensive to be creating and destroying tensors. They all need to live in a layer object so they aren't created and destroyed when the network runs. But rather are allocated once. But more than that kernels run asynchronously in cuda, so that _cuda_layer_normalize_gradient kernel launches but then those two variables are immediately freed. That's probably why the cuda version isn't working. Since it's running on dangling pointers.

Sorry I never really looked at this. You write such good PRs I just kinda skimmed this one and was like "yeah another Adria PR, going to be great and looks great 👍 :D " without really reading it all.

arrufat · 2024-08-29T06:05:22Z

Oh, right, what was I thinking. It looks like I got confused half-way through the code, where I should normalize each channel independently, but I ended up trying to normalize along k, nr, and nc. Definitely, k should not be included in the normalization.

Hopefully, I will fix that and the dangling pointer tonight if life allows.

arrufat · 2024-08-29T12:21:10Z

I was just checking this again:
https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html

It seems like, it does normalize along C, H, W. and there's one beta and one gamma for each normalized item.

import torch
N, C, H, W = 20, 5, 10, 10
x = torch.randn(N, C, H, W)
layer_norm = torch.nn.LayerNorm([C, H, W])
y = layer_norm(x)
sum(p.numel() for p in layer_norm.parameters() if p.requires_grad)  # 1000 = 2 * (5 * 10 * 10)

So, maybe the issue is just the dangling pointers? I will make sure the contracts are correct and respected, though.

EDIT: after necrobumping ConvNeXt, each LayerNorm only has 2 * C learnable parameters (beta and gamma), so the implementation here is wrong. You're right about the dimensions of beta and gamma: they should only have k parameters each.

davisking · 2024-08-29T15:53:12Z

Yeah, might be just the dangling pointer and everything else is fine.

…

On Thu, Aug 29, 2024 at 8:21 AM Adrià ***@***.***> wrote: I was just checking this again: https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html image.png (view on web) <https://github.com/user-attachments/assets/8a757676-7a92-4649-9bf0-2e5f23325901> It seems like, it does normalize along C, H, W. and there's one gamma and one beta for each normalized item. import torchN, C, H, W = 20, 5, 10, 10x = torch.randn(N, C, H, W)layer_norm = torch.nn.LayerNorm([C, H, W])y = layer_norm(x)sum(p.numel() for p in layer_norm.parameters() if p.requires_grad) # 1000 = 2 * (5 * 10 * 10) So, maybe the issue is just the dangling pointers? I will make sure the contracts are correct and respected, though. — Reply to this email directly, view it on GitHub <#3001 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPYFR4GTZW35ALFN7QVDHLZT4G45AVCNFSM6AAAAABNIK6NHOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJXGQ4TONJVGQ> . You are receiving this because you commented.Message ID: ***@***.***>

arrufat · 2024-08-30T15:46:24Z

I am now confident about the CPU implementation, however, the CUDA version still fails.
There's a mismatch between CPU and CUDA and in the beta and gamma gradients, which are trivial, but I can't spot the mistake.
It's getting late, I'll try again later. Feel free to check, it must be something really stupid.

arrufat · 2024-08-31T12:19:11Z

I honestly don't know what else to do.
The CPU version works correctly now, but not the CUDA version.

However, if you run the test_layer_normalize with CUDA enabled, you'll see that all the functionality is on par with the CPU version:

normalized output tensor
src_grad
gamma_grad
beta_grad
dmeans
dvars

All of them are within the tolerance error of 1e-5 or 1e-4. However, test_layer with layer_norm still fails.

arrufat · 2024-08-31T12:50:28Z

Ok, fixed a race condition, now test_layer complains like this:

Average parameter gradient error is somewhat large at: 0.00713434

EDIT: after running a clean build, it's working!

davisking · 2024-08-31T14:44:10Z

Nice. I'm away from my computer. I'll look in a bit. Seems like you got it 🥳

arrufat · 2024-08-31T22:10:13Z

It took an awful amount of time...

dlib/cuda/cuda_dlib.cu

davisking · 2024-09-01T12:31:00Z

dlib/cuda/cuda_dlib.cu

+ for (auto nk : grid_stride_range_y(0, ns * ks))
+ {
+ const auto n = nk / ks;
+ const auto k = nk % ks;
+ const auto ps = s + (n * ks + k) * num;
+ const auto pgi = gi + (n * ks + k) * num;
+ float temp_bg = 0;
+ float temp_gg = 0;
+ for (auto i : grid_stride_range(0, num))
+ {
+ const float x_hat = (ps[i] - m[n]) * v[n];
+ temp_bg += pgi[i];
+ temp_gg += pgi[i] * x_hat;
+ }
+ warp_reduce_atomic_add(bg[k], temp_bg);
+ warp_reduce_atomic_add(gg[k], temp_gg);
+ }
+ __syncthreads();


Yeah that kind of warp reduction loop is the best way I know to do it too.

arrufat · 2024-09-01T12:39:46Z

After the previous two commits, the network went to train from 320 img/s to 2450 img/s (close to the official CUDA/CUDNN batch norm at 2560 img/s)

davisking · 2024-09-01T12:47:58Z

After the previous two commits, the network went to train from 320 img/s to 2450 img/s (close to the official CUDA/CUDNN batch norm at 2560 img/s)

Yeah that's awesome. All the tests are passing for me too on a GPU machine. Passing for you too now? Anything else you want to change before I merge it? :D

arrufat · 2024-09-01T12:50:06Z

Nothing else to add, I think it's done now. FINALLY.

And yes, tests are passing now :D

davisking · 2024-09-01T13:01:34Z

Yeah nice, thanks for all the good work. Looks perfect :D

Fix layer_normalize gradients

4c6e4bd

arrufat marked this pull request as draft August 28, 2024 13:52

davisking mentioned this pull request Aug 29, 2024

Add RMS Normalization Layer #2999

Merged

arrufat added 2 commits August 30, 2024 20:59

fix layer_norm CPU

60f29fc

attempt to fix the cuda version

6118f80

arrufat added 2 commits August 31, 2024 17:18

fix gamma_grad and beta_grad

cf4af98

update cuda test

e807c58

use a block of size 1 to avoid race conditions

048818a

arrufat marked this pull request as ready for review August 31, 2024 14:22

arrufat commented Sep 1, 2024

View reviewed changes

dlib/cuda/cuda_dlib.cu Outdated Show resolved Hide resolved

improve the speed of CUDA path of layer_norm

dd1cdc6

davisking reviewed Sep 1, 2024

View reviewed changes

improve the speed of CUDA path of layer_norm

c0627a5

davisking merged commit 253098e into davisking:master Sep 1, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix layer_normalize gradients #3001

Fix layer_normalize gradients #3001

arrufat commented Aug 28, 2024

arrufat commented Aug 28, 2024

davisking commented Aug 29, 2024

davisking commented Aug 29, 2024 •

edited

Loading

arrufat commented Aug 29, 2024

arrufat commented Aug 29, 2024 •

edited

Loading

davisking commented Aug 29, 2024 via email

arrufat commented Aug 30, 2024

arrufat commented Aug 31, 2024 •

edited

Loading

arrufat commented Aug 31, 2024 •

edited

Loading

davisking commented Aug 31, 2024

arrufat commented Aug 31, 2024

davisking Sep 1, 2024

arrufat commented Sep 1, 2024

davisking commented Sep 1, 2024

arrufat commented Sep 1, 2024 •

edited

Loading

davisking commented Sep 1, 2024

Fix layer_normalize gradients #3001

Fix layer_normalize gradients #3001

Conversation

arrufat commented Aug 28, 2024

arrufat commented Aug 28, 2024

davisking commented Aug 29, 2024

davisking commented Aug 29, 2024 • edited Loading

arrufat commented Aug 29, 2024

arrufat commented Aug 29, 2024 • edited Loading

davisking commented Aug 29, 2024 via email

arrufat commented Aug 30, 2024

arrufat commented Aug 31, 2024 • edited Loading

arrufat commented Aug 31, 2024 • edited Loading

davisking commented Aug 31, 2024

arrufat commented Aug 31, 2024

davisking Sep 1, 2024

Choose a reason for hiding this comment

arrufat commented Sep 1, 2024

davisking commented Sep 1, 2024

arrufat commented Sep 1, 2024 • edited Loading

davisking commented Sep 1, 2024

davisking commented Aug 29, 2024 •

edited

Loading

arrufat commented Aug 29, 2024 •

edited

Loading

arrufat commented Aug 31, 2024 •

edited

Loading

arrufat commented Aug 31, 2024 •

edited

Loading

arrufat commented Sep 1, 2024 •

edited

Loading