Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented normalization to power raising method for DataGrid. #143

Merged
merged 1 commit into from
Apr 13, 2023

Conversation

NikoOinonen
Copy link
Collaborator

Fixes #136

Copy link
Collaborator

@ProkopHapala ProkopHapala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think it is great we have this functionality. Probaly it can be optimized in future, if it is worth it

  • Maybe it woudl be nice to have some performace comparison to other kernells of similar size - e.g. time of reduction vs time of FFT

  • There is a test, so I guess it works :)

// In the end, the value at index 0 is the total sum for this work group. Save this to
// global memory.
if (iL == 0) {
array_out[iGr] = shMem[0];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output is then used as an input for next level of GPU reduction? I mean is normalizeSumReduce() called iteratively to reduce 1/256, 1/256^2, 1/256^2 ... etc ?

barrier(CLK_LOCAL_MEM_FENCE);
for (int s = reduceGroupSize / 2; s > 0; s /= 2) {
if (iL < s) {
shMem[iL] += shMem[iL + s];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would not it be better to first save the sum to __private variable, and only at the end save the result to __local (shared) memory so that other processors can see it? __private memory is still much faster than __local

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This memory definitely needs to be visible to all workers in the group. As far as I understand, the __private variables are only visible to the current worker.

Copy link
Collaborator

@ProkopHapala ProkopHapala Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if you reduce just 2 items per and then synchronize, then you need it to be shared.

What I was thinking maybe single procesor can read multiple items (>2) and save them into private temporary variable, then write this variable into shared memory, this aloso mean less often synchronization

(I write from mobile, try to sketch some code when I get to Computer)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I misunderstand, this is exactly what is happening in the first loop, where the current worker first sequentially reads several items from the global memory, sums them together, and then puts it into the shared memory for the parallel part. Finding some good balance between the two would take more testing.

# First do sums of the input array within each work group...
cl_program.normalizeSumReduce(queue, global_size, local_size, array_in, array_out, n)
# ... then sum the results of the first kernel call
cl_program.sumSingleGroup(queue, local_size, local_size, array_out, n_groups)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha OK, I see, there are 2 passes...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I purposefully set the number of groups in the first pass to be at most the size of the work group so that in the second pass it can be done by a single work group.

@NikoOinonen
Copy link
Collaborator Author

Maybe it woudl be nice to have some performace comparison to other kernells of similar size - e.g. time of reduction vs time of FFT

Just on a quick test using that 350^3 array in the test script, the time for the sum reduction on my Intel integrated GPU is ~4 ms, and the time taken to compute the exponent is ~20ms. I think the FFTs are typically in the range of some tens of ms to some hundreds ms, depending on the scan size. So overall, I would say that the sum reduction here is pretty trivial, and probably not worth optimizing for now.

@NikoOinonen NikoOinonen merged commit f3d3adc into main Apr 13, 2023
@NikoOinonen NikoOinonen deleted the density-exponent-fix branch April 13, 2023 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proper handling of negative values in FDBM electron densities in OpenCL
2 participants