[CONV] fix naive conv kernel for large tensors #3434

bghimireamd · 2024-12-12T12:46:05Z

for larger tensor I was seeing
gdims[0] : 38,654,705,664
globalWorkSizeX : 4,294,967,295 (max allowed by uint32_t)
~~MaxGridDimX : 2,147,483,647~~

gdims[0] was exceeding ~~MaxGridDimX~~ globalWorkSizeX for below driver command.

./bin/MIOpenDriver convbfp16 -n 589824 -c 256 -H 4 -w 34 -k 256 -y 1 -x 3 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m c onv -g 1 -F 1 -t 1

In this PR I reduce the gdims[0] to be set within MaxGridDimX.

BradPepersAMD

can we add a unit test for this?

CAHEK7 · 2024-12-13T04:19:16Z

src/solver/conv/conv_direct_naive_conv.cpp

+        }
+        else
+        {
+            grid_size = (all_workload + block_size - 1) / block_size;


See #2748
It is an integer Ceil() function.
It's just a reminder that the problem still exists.

CAHEK7 · 2024-12-13T04:31:21Z

src/solver/conv/conv_direct_naive_conv.cpp

+        size_t all_workload = static_cast<size_t>(group) * n * ho;
+        if(all_workload <= block_size)
+        {
+            grid_size = all_workload;
+        }
+        else
+        {
+            grid_size = (all_workload + block_size - 1) / block_size;
+        }


I don't feel like the problem is solved here, actually I see a few more problems.
Now it divides total workload by 256 - technically speaking it's just 256 times further from now. Quite far away, but still there.
And since it is dividing total workload by 256, we have 256 times underloaded GPU. Can be a huge performance drop for a wide range of legit tensor sizes, and even it's a naive algorithm, we are using it everywhere in the tests to compute reference data.
The last concern is the kernel itself - it should be aware about that fact that the number of groups can be capped, and it should contain extra loop to handle it.

I will implemented the kernel itself to handle the capped number of groups.

I'm not sure that a new kernel should be implemented, or the old one can be changed, or even the old one has already got this support and we should change anything - firstly it should be checked.

Underloaded GPU problem should be fixed too.
Let's imagine - all_workload is 256 and we have a grid size of 256; when it is 257, the grid size suddenly becomes 2.
We have more work but fewer workers.

Maybe we don't need to modify the kernel. We could loop over the same kernel, adjusting the chunk size and buffer offsets as needed. This would handle the limitation of uint32_t in hipExtModuleLaunchKernel which currently overflows when we pass a global work size as gridX((589824 *256) *256 ).

I was looking to see how we handle this issue in other locations (since it seems like it would be a global constraint).

Looks like the batched_transpose solver also has a version of this issue (and seems somewhat likely we have this issue throughout MIOpen).

For HIP this is a general constraint across any kernel launch I think:

What are the maximum limits of kernel launch parameters? Product of block.x, block.y, and block.z should be less than 1024. Please note, HIP does not support kernel launch with total work items defined in dimension with size gridDim x blockDim >= 2^32, so gridDim.x * blockDim.x, gridDim.y * blockDim.y and gridDim.z * blockDim.z are always less than 2^32.

I think we might need to come up with a general solution for this, and make sure it's implemented broadly.

I was looking to see how we handle this issue in other locations (since it seems like it would be a global constraint).

We don't, kind of. There are some places where the kernel is aware about number of workgroups limit and sometimes the number of workgroups is capped by some value like 4096. That's mostly it.

I think we might need to come up with a general solution for this, and make sure it's implemented broadly.

I'm not sure if thatcan be easily implemented. The main reason is that: the number of workgroups heavily depends on the algorithm and, the most important, on the kernel itself, and sometimes it even comes from heuristics.
Putting some hardlimit in the library will not resolve the problem, and it can even do a bad stuff like previously runtime explicitly failed to launch the kernel, but now it will be silently capped, launched and produced a wrong result, which will be much harded to notice, especially when you run production code without any verifications.

CAHEK7 · 2024-12-13T04:43:23Z

can we add a unit test for this?

It's barely possible, at least without affecting the current development and CI, and at least the test should be added as a special "huge tensor" tests - specifically tailored case for specific machines.

we may have memory problems on CI machines - not every GPU can support so huge tensors (probably it's not relevant for MI200+ cards)
we will have a problem with tensor initialization on the host - probably the problem does not exist for convolutions, since Artem replaced CPU based tensor init with GPU based (and probably for some particular datatypes), but all the other algorithms do not use GPU initialization
we will have problems with verification - even if we initialize everything on gpu and compute everything on gpu, we still have to copy the results and the reference data back to CPU and run slow CPU based verification
it's almost like 3 - since it's a naive algorithm, we are using CPU for it's verification. CPU convolution is kind of slow for huge tensors

bghimireamd · 2024-12-13T15:22:23Z

can we add a unit test for this?

It's barely possible, at least without affecting the current development and CI, and at least the test should be added as a special "huge tensor" tests - specifically tailored case for specific machines.

we may have memory problems on CI machines - not every GPU can support so huge tensors (probably it's not relevant for MI200+ cards)

we will have a problem with tensor initialization on the host - probably the problem does not exist for convolutions, since Artem replaced CPU based tensor init with GPU based (and probably for some particular datatypes), but all the other algorithms do not use GPU initialization

we will have problems with verification - even if we initialize everything on gpu and compute everything on gpu, we still have to copy the results and the reference data back to CPU and run slow CPU based verification

it's almost like 3 - since it's a naive algorithm, we are using CPU for it's verification. CPU convolution is kind of slow for huge tensors

We do have https://github.com/ROCm/MIOpen/blob/develop/test/gpu_reference_kernel.cpp

CAHEK7 · 2024-12-13T21:54:32Z

We do have https://github.com/ROCm/MIOpen/blob/develop/test/gpu_reference_kernel.cpp

Yes, that's a naive CPU single threaded ultra slow verification for naive GPU algorithm. That test is not about "huge" tensors, it has exactly those problems which I described.

bghimireamd · 2024-12-16T10:37:13Z

can we add a unit test for this?

It's barely possible, at least without affecting the current development and CI, and at least the test should be added as a special "huge tensor" tests - specifically tailored case for specific machines.

we may have memory problems on CI machines - not every GPU can support so huge tensors (probably it's not relevant for MI200+ cards)

we will have a problem with tensor initialization on the host - probably the problem does not exist for convolutions, since Artem replaced CPU based tensor init with GPU based (and probably for some particular datatypes), but all the other algorithms do not use GPU initialization

we will have problems with verification - even if we initialize everything on gpu and compute everything on gpu, we still have to copy the results and the reference data back to CPU and run slow CPU based verification

it's almost like 3 - since it's a naive algorithm, we are using CPU for it's verification. CPU convolution is kind of slow for huge tensors

Yes, we do need to do the slow cpu run. I can the test a nightly run.

CAHEK7 · 2024-12-17T00:02:26Z

Yes, we do need to do the slow cpu run. I can the test a nightly run.

I'm not sure that we do need. It depends on the way how we treat the reference data.

For example, when two algorithms have a consensus during a manual run, we can assume that they produce the same data.
Following that logic, we can do the following steps:

manually run naive GPU vs naive CPU and be sure that both are producing the same output
add the same case, but using naive GPU and non-naive GPU algorithms - assuming that if naive GPU produced correct result on step 1 and naive and non-naive GPUs are doing the same, it means that everything is fine.

The only case when it can get broken is when we simultaneously and exactly in the same way break naive and non-naive implementations - in that case both algorithms produce the same wrong result, having the test passed. In the other cases, the test will indicate that either naive or non-naive version is broken, and that's enough to start manually checking everything.

fix naive conv kernel for large tensors

888894f

bghimireamd requested review from BrianHarrisonAMD, junliume and BradPepersAMD as code owners December 12, 2024 12:46

hand nhwc case for conv reference kernel

64a278f

BradPepersAMD reviewed Dec 12, 2024

View reviewed changes

bghimireamd added 3 commits December 12, 2024 15:37

fix bwd 2D and 3D

1665726

handle smaller case too

36fd522

revert bloc_size

3554b20

CAHEK7 suggested changes Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CONV] fix naive conv kernel for large tensors #3434

[CONV] fix naive conv kernel for large tensors #3434

bghimireamd commented Dec 12, 2024 •

edited

Loading

BradPepersAMD left a comment

CAHEK7 Dec 13, 2024

CAHEK7 Dec 13, 2024

bghimireamd Dec 13, 2024

CAHEK7 Dec 13, 2024

bghimireamd Dec 16, 2024 •

edited

Loading

BrianHarrisonAMD Dec 16, 2024

CAHEK7 Dec 16, 2024

CAHEK7 commented Dec 13, 2024

bghimireamd commented Dec 13, 2024

CAHEK7 commented Dec 13, 2024

bghimireamd commented Dec 16, 2024

CAHEK7 commented Dec 17, 2024

[CONV] fix naive conv kernel for large tensors #3434

Are you sure you want to change the base?

[CONV] fix naive conv kernel for large tensors #3434

Conversation

bghimireamd commented Dec 12, 2024 • edited Loading

BradPepersAMD left a comment

Choose a reason for hiding this comment

CAHEK7 Dec 13, 2024

Choose a reason for hiding this comment

CAHEK7 Dec 13, 2024

Choose a reason for hiding this comment

bghimireamd Dec 13, 2024

Choose a reason for hiding this comment

CAHEK7 Dec 13, 2024

Choose a reason for hiding this comment

bghimireamd Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

BrianHarrisonAMD Dec 16, 2024

Choose a reason for hiding this comment

CAHEK7 Dec 16, 2024

Choose a reason for hiding this comment

CAHEK7 commented Dec 13, 2024

bghimireamd commented Dec 13, 2024

CAHEK7 commented Dec 13, 2024

bghimireamd commented Dec 16, 2024

CAHEK7 commented Dec 17, 2024

bghimireamd commented Dec 12, 2024 •

edited

Loading

bghimireamd Dec 16, 2024 •

edited

Loading