-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CONV] fix naive conv kernel for large tensors #3434
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a unit test for this?
} | ||
else | ||
{ | ||
grid_size = (all_workload + block_size - 1) / block_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #2748
It is an integer Ceil()
function.
It's just a reminder that the problem still exists.
size_t all_workload = static_cast<size_t>(group) * n * ho; | ||
if(all_workload <= block_size) | ||
{ | ||
grid_size = all_workload; | ||
} | ||
else | ||
{ | ||
grid_size = (all_workload + block_size - 1) / block_size; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel like the problem is solved here, actually I see a few more problems.
Now it divides total workload by 256 - technically speaking it's just 256 times further from now. Quite far away, but still there.
And since it is dividing total workload by 256, we have 256 times underloaded GPU. Can be a huge performance drop for a wide range of legit tensor sizes, and even it's a naive algorithm, we are using it everywhere in the tests to compute reference data.
The last concern is the kernel itself - it should be aware about that fact that the number of groups can be capped, and it should contain extra loop to handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will implemented the kernel itself to handle the capped number of groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that a new kernel should be implemented, or the old one can be changed, or even the old one has already got this support and we should change anything - firstly it should be checked.
Underloaded GPU problem should be fixed too.
Let's imagine - all_workload
is 256 and we have a grid size of 256; when it is 257, the grid size suddenly becomes 2.
We have more work but fewer workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't need to modify the kernel. We could loop over the same kernel, adjusting the chunk size and buffer offsets as needed. This would handle the limitation of uint32_t in hipExtModuleLaunchKernel which currently overflows when we pass a global work size as gridX((589824 *256) *256 ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking to see how we handle this issue in other locations (since it seems like it would be a global constraint).
Looks like the batched_transpose solver also has a version of this issue (and seems somewhat likely we have this issue throughout MIOpen).
For HIP this is a general constraint across any kernel launch I think:
What are the maximum limits of kernel launch parameters?
Product of block.x, block.y, and block.z should be less than 1024. Please note, HIP does not support kernel launch with total work items defined in dimension with size gridDim x blockDim >= 2^32, so gridDim.x * blockDim.x, gridDim.y * blockDim.y and gridDim.z * blockDim.z are always less than 2^32.
I think we might need to come up with a general solution for this, and make sure it's implemented broadly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking to see how we handle this issue in other locations (since it seems like it would be a global constraint).
We don't, kind of. There are some places where the kernel is aware about number of workgroups limit and sometimes the number of workgroups is capped by some value like 4096. That's mostly it.
I think we might need to come up with a general solution for this, and make sure it's implemented broadly.
I'm not sure if thatcan be easily implemented. The main reason is that: the number of workgroups heavily depends on the algorithm and, the most important, on the kernel itself, and sometimes it even comes from heuristics.
Putting some hardlimit in the library will not resolve the problem, and it can even do a bad stuff like previously runtime explicitly failed to launch the kernel, but now it will be silently capped, launched and produced a wrong result, which will be much harded to notice, especially when you run production code without any verifications.
It's barely possible, at least without affecting the current development and CI, and at least the test should be added as a special "huge tensor" tests - specifically tailored case for specific machines.
|
We do have https://github.com/ROCm/MIOpen/blob/develop/test/gpu_reference_kernel.cpp |
Yes, that's a naive CPU single threaded ultra slow verification for naive GPU algorithm. That test is not about "huge" tensors, it has exactly those problems which I described. |
Yes, we do need to do the slow cpu run. I can the test a nightly run. |
I'm not sure that we do need. It depends on the way how we treat the reference data. For example, when two algorithms have a consensus during a manual run, we can assume that they produce the same data.
The only case when it can get broken is when we simultaneously and exactly in the same way break naive and non-naive implementations - in that case both algorithms produce the same wrong result, having the test passed. In the other cases, the test will indicate that either naive or non-naive version is broken, and that's enough to start manually checking everything. |
for larger tensor I was seeing
gdims[0] : 38,654,705,664
globalWorkSizeX : 4,294,967,295 (max allowed by uint32_t)
MaxGridDimX : 2,147,483,647gdims[0] was exceeding
MaxGridDimXglobalWorkSizeX for below driver command../bin/MIOpenDriver convbfp16 -n 589824 -c 256 -H 4 -w 34 -k 256 -y 1 -x 3 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m c onv -g 1 -F 1 -t 1
In this PR I reduce the gdims[0] to be set within MaxGridDimX.