-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented normalization to power raising method for DataGrid. #143
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I think it is great we have this functionality. Probaly it can be optimized in future, if it is worth it
-
Maybe it woudl be nice to have some performace comparison to other kernells of similar size - e.g. time of reduction vs time of FFT
-
There is a test, so I guess it works :)
// In the end, the value at index 0 is the total sum for this work group. Save this to | ||
// global memory. | ||
if (iL == 0) { | ||
array_out[iGr] = shMem[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output is then used as an input for next level of GPU reduction? I mean is normalizeSumReduce()
called iteratively to reduce 1/256, 1/256^2, 1/256^2 ... etc ?
barrier(CLK_LOCAL_MEM_FENCE); | ||
for (int s = reduceGroupSize / 2; s > 0; s /= 2) { | ||
if (iL < s) { | ||
shMem[iL] += shMem[iL + s]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would not it be better to first save the sum to __private
variable, and only at the end save the result to __local
(shared
) memory so that other processors can see it? __private
memory is still much faster than __local
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This memory definitely needs to be visible to all workers in the group. As far as I understand, the __private
variables are only visible to the current worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if you reduce just 2 items per and then synchronize, then you need it to be shared.
What I was thinking maybe single procesor can read multiple items (>2) and save them into private temporary variable, then write this variable into shared memory, this aloso mean less often synchronization
(I write from mobile, try to sketch some code when I get to Computer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I misunderstand, this is exactly what is happening in the first loop, where the current worker first sequentially reads several items from the global memory, sums them together, and then puts it into the shared memory for the parallel part. Finding some good balance between the two would take more testing.
# First do sums of the input array within each work group... | ||
cl_program.normalizeSumReduce(queue, global_size, local_size, array_in, array_out, n) | ||
# ... then sum the results of the first kernel call | ||
cl_program.sumSingleGroup(queue, local_size, local_size, array_out, n_groups) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha OK, I see, there are 2 passes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I purposefully set the number of groups in the first pass to be at most the size of the work group so that in the second pass it can be done by a single work group.
Just on a quick test using that 350^3 array in the test script, the time for the sum reduction on my Intel integrated GPU is ~4 ms, and the time taken to compute the exponent is ~20ms. I think the FFTs are typically in the range of some tens of ms to some hundreds ms, depending on the scan size. So overall, I would say that the sum reduction here is pretty trivial, and probably not worth optimizing for now. |
Fixes #136