Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPH pixelization routine is slow #2682

Closed
Xarthisius opened this issue Jun 24, 2020 · 7 comments
Closed

SPH pixelization routine is slow #2682

Xarthisius opened this issue Jun 24, 2020 · 7 comments
Assignees
Labels
triage Triage needed

Comments

@Xarthisius
Copy link
Member

Bug report

Bug summary

One of the cookbook recipes takes a lot of time on testing box:

$ time python doc/source/cookbook/image_resolution.py 

real	16m6.577s
user	16m5.294s
sys	0m4.467s

I think that the pixelize_sph_kernel_projection could be easily parallelized to circumvent this.

@Xarthisius Xarthisius self-assigned this Jun 24, 2020
@triage-new-issues triage-new-issues bot added the triage Triage needed label Jun 24, 2020
@matthewturk
Copy link
Member

I poked a little bit at it and was able to get a minor speedup at the cost of some additional complexity (in the neighborhood of 20% maybe?) by doing this diff:

diff --git a/yt/utilities/lib/pixelization_routines.pyx b/yt/utilities/lib/pixelization_routines.pyx
index 147064c28..cfc1d79e2 100644
--- a/yt/utilities/lib/pixelization_routines.pyx
+++ b/yt/utilities/lib/pixelization_routines.pyx
@@ -887,7 +887,9 @@ cdef class SPHKernelInterpolationTable:
     cdef public object kernel_name
     cdef kernel_func kernel
     cdef np.float64_t[::1] table
+    cdef np.float64_t[::1] dtable
     cdef np.float64_t[::1] q2_vals
+    cdef np.float64_t[::1] q2_over_range
     cdef np.float64_t q2_range, iq2_range
 
     def __init__(self, kernel_name):
@@ -926,15 +928,22 @@ cdef class SPHKernelInterpolationTable:
         cdef int i
         self.table = cvarray(format="d", shape=(TABLE_NVALS,),
                              itemsize=sizeof(np.float64_t))
+        self.dtable = cvarray(format="d", shape=(TABLE_NVALS,),
+                              itemsize=sizeof(np.float64_t))
         self.q2_vals = cvarray(format="d", shape=(TABLE_NVALS,),
                              itemsize=sizeof(np.float64_t))
+        self.q2_over_range = cvarray(format="d", shape=(TABLE_NVALS,),
+                             itemsize=sizeof(np.float64_t))
         # We run from 0 to 1 here over R
         for i in range(TABLE_NVALS):
             self.q2_vals[i] = i * 1.0/(TABLE_NVALS-1)
             self.table[i] = self.integrate_q2(self.q2_vals[i])
-
         self.q2_range = self.q2_vals[TABLE_NVALS-1] - self.q2_vals[0]
         self.iq2_range = (TABLE_NVALS-1)/self.q2_range
+        for i in range(TABLE_NVALS - 1):
+            # we prohibit accessing the final value
+            self.dtable[i] = self.table[i+1] - self.table[i]
+            self.q2_over_range[i] = self.q2_vals[i] * self.iq2_range
 
     @cython.initializedcheck(False)
     @cython.boundscheck(False)
@@ -942,13 +951,15 @@ cdef class SPHKernelInterpolationTable:
     @cython.cdivision(True)
     cdef inline np.float64_t interpolate(self, np.float64_t q2) nogil:
         cdef int index
+        cdef np.float64_t qf
         cdef np.float64_t F_interpolate
-        index = <int>((q2 - self.q2_vals[0])*(self.iq2_range))
+        qf = q2 * self.iq2_range
+        index = <int>qf
+        # q2_vals[0] is by construction 0
         if index >= TABLE_NVALS:
             return 0.0
         F_interpolate = self.table[index] + (
-                (self.table[index+1] - self.table[index])
-               *(q2 - self.q2_vals[index])*self.iq2_range)
+            self.dtable[index] * (qf - self.q2_over_range[index]))
         return F_interpolate
 
     def interpolate_array(self, np.float64_t[:] q2_vals):

but I'm not sure it's worth the additional complexity, and the speedup from parallelization is better anyway. This is one of those things that would be really well-suited to GPU computation.

@Xarthisius
Copy link
Member Author

Just noting that this method became serial instead of parallel with afc2b0c and it also shows there are other places where parallelization should be restored.

@matthewturk
Copy link
Member

@AshKelly do you know if the cython issue in question has been fixed?

@AshKelly
Copy link
Member

Looks like the issue is still open - cython/cython#2316

But I will checkout this repo https://github.com/ngoldbaum/cython-issue and confirm if it has been patched.

@Xarthisius
Copy link
Member Author

Oh, looks like I didn't know OpenMP 4.5's reduction is a thing (not surprisingly since last time I used OpenMP it was at 2.0 :P) and I think my implementation works around it.

@AshKelly
Copy link
Member

yeah, local buffers completely avoid it. We toyed with the idea at the time, but Nathan didn't like the extra memory overhead and I think we hoped for a relatively fast fix from the cython side. I think it's complicated fixing from their side since they didn't specify a minimum version of OpenMP / gcc.

Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 24, 2020
@Xarthisius
Copy link
Member Author

As per Matt's suggestion, it's also viable to parallelize the inner loops over pixels. That makes writes to buff atomic (no additional buffers needed) and yields similar speed up in the case presented here.

Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 24, 2020
Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 24, 2020
Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 24, 2020
Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 24, 2020
Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 25, 2020
Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 30, 2020
Xarthisius added a commit to Xarthisius/yt that referenced this issue Jul 20, 2020
Xarthisius added a commit to Xarthisius/yt that referenced this issue Jul 22, 2020
Xarthisius added a commit to Xarthisius/yt that referenced this issue Jul 23, 2020
Xarthisius added a commit to Xarthisius/yt that referenced this issue Sep 17, 2020
neutrinoceros added a commit that referenced this issue Sep 24, 2020
Parallelize pixelize_sph_kernel_projection. Fixes #2682
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Triage needed
Projects
None yet
Development

No branches or pull requests

3 participants