SPH pixelization routine is slow #2682

Xarthisius · 2020-06-24T17:46:36Z

Bug report

Bug summary

One of the cookbook recipes takes a lot of time on testing box:

$ time python doc/source/cookbook/image_resolution.py 

real	16m6.577s
user	16m5.294s
sys	0m4.467s

I think that the pixelize_sph_kernel_projection could be easily parallelized to circumvent this.

The text was updated successfully, but these errors were encountered:

matthewturk · 2020-06-24T18:49:14Z

I poked a little bit at it and was able to get a minor speedup at the cost of some additional complexity (in the neighborhood of 20% maybe?) by doing this diff:

diff --git a/yt/utilities/lib/pixelization_routines.pyx b/yt/utilities/lib/pixelization_routines.pyx
index 147064c28..cfc1d79e2 100644
--- a/yt/utilities/lib/pixelization_routines.pyx
+++ b/yt/utilities/lib/pixelization_routines.pyx
@@ -887,7 +887,9 @@ cdef class SPHKernelInterpolationTable:
     cdef public object kernel_name
     cdef kernel_func kernel
     cdef np.float64_t[::1] table
+    cdef np.float64_t[::1] dtable
     cdef np.float64_t[::1] q2_vals
+    cdef np.float64_t[::1] q2_over_range
     cdef np.float64_t q2_range, iq2_range
 
     def __init__(self, kernel_name):
@@ -926,15 +928,22 @@ cdef class SPHKernelInterpolationTable:
         cdef int i
         self.table = cvarray(format="d", shape=(TABLE_NVALS,),
                              itemsize=sizeof(np.float64_t))
+        self.dtable = cvarray(format="d", shape=(TABLE_NVALS,),
+                              itemsize=sizeof(np.float64_t))
         self.q2_vals = cvarray(format="d", shape=(TABLE_NVALS,),
                              itemsize=sizeof(np.float64_t))
+        self.q2_over_range = cvarray(format="d", shape=(TABLE_NVALS,),
+                             itemsize=sizeof(np.float64_t))
         # We run from 0 to 1 here over R
         for i in range(TABLE_NVALS):
             self.q2_vals[i] = i * 1.0/(TABLE_NVALS-1)
             self.table[i] = self.integrate_q2(self.q2_vals[i])
-
         self.q2_range = self.q2_vals[TABLE_NVALS-1] - self.q2_vals[0]
         self.iq2_range = (TABLE_NVALS-1)/self.q2_range
+        for i in range(TABLE_NVALS - 1):
+            # we prohibit accessing the final value
+            self.dtable[i] = self.table[i+1] - self.table[i]
+            self.q2_over_range[i] = self.q2_vals[i] * self.iq2_range
 
     @cython.initializedcheck(False)
     @cython.boundscheck(False)
@@ -942,13 +951,15 @@ cdef class SPHKernelInterpolationTable:
     @cython.cdivision(True)
     cdef inline np.float64_t interpolate(self, np.float64_t q2) nogil:
         cdef int index
+        cdef np.float64_t qf
         cdef np.float64_t F_interpolate
-        index = <int>((q2 - self.q2_vals[0])*(self.iq2_range))
+        qf = q2 * self.iq2_range
+        index = <int>qf
+        # q2_vals[0] is by construction 0
         if index >= TABLE_NVALS:
             return 0.0
         F_interpolate = self.table[index] + (
-                (self.table[index+1] - self.table[index])
-               *(q2 - self.q2_vals[index])*self.iq2_range)
+            self.dtable[index] * (qf - self.q2_over_range[index]))
         return F_interpolate
 
     def interpolate_array(self, np.float64_t[:] q2_vals):

but I'm not sure it's worth the additional complexity, and the speedup from parallelization is better anyway. This is one of those things that would be really well-suited to GPU computation.

Xarthisius · 2020-06-24T18:54:31Z

Just noting that this method became serial instead of parallel with afc2b0c and it also shows there are other places where parallelization should be restored.

matthewturk · 2020-06-24T18:57:09Z

@AshKelly do you know if the cython issue in question has been fixed?

AshKelly · 2020-06-24T19:11:09Z

Looks like the issue is still open - cython/cython#2316

But I will checkout this repo https://github.com/ngoldbaum/cython-issue and confirm if it has been patched.

Xarthisius · 2020-06-24T19:13:59Z

Oh, looks like I didn't know OpenMP 4.5's reduction is a thing (not surprisingly since last time I used OpenMP it was at 2.0 :P) and I think my implementation works around it.

AshKelly · 2020-06-24T19:16:39Z

yeah, local buffers completely avoid it. We toyed with the idea at the time, but Nathan didn't like the extra memory overhead and I think we hoped for a relatively fast fix from the cython side. I think it's complicated fixing from their side since they didn't specify a minimum version of OpenMP / gcc.

Xarthisius · 2020-06-24T19:38:35Z

As per Matt's suggestion, it's also viable to parallelize the inner loops over pixels. That makes writes to buff atomic (no additional buffers needed) and yields similar speed up in the case presented here.

Parallelize pixelize_sph_kernel_projection. Fixes #2682

Xarthisius self-assigned this Jun 24, 2020

triage-new-issues bot added the triage Triage needed label Jun 24, 2020

Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 24, 2020

Parallelize pixelize_sph_kernel_projection. Fixes yt-project#2682

95ba7b7

Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 24, 2020

Parallelize pixelize_sph_kernel_* methods. Fixes yt-project#2682

55006c3

Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 24, 2020

Parallelize pixelize_sph_kernel_* methods. Fixes yt-project#2682

e0146b6

Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 24, 2020

Parallelize pixelize_sph_kernel_* methods. Fixes yt-project#2682

55af94c

Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 24, 2020

Parallelize pixelize_sph_kernel_* methods. Fixes yt-project#2682

2febb94

Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 24, 2020

Parallelize pixelize_sph_kernel_* methods. Fixes yt-project#2682

6c2a48f

Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 25, 2020

Parallelize pixelize_sph_kernel_* methods. Fixes yt-project#2682

31397fd

Xarthisius added a commit to Xarthisius/yt that referenced this issue Jun 30, 2020

Parallelize pixelize_sph_kernel_* methods. Fixes yt-project#2682

d19846d

Xarthisius added a commit to Xarthisius/yt that referenced this issue Jul 20, 2020

Parallelize pixelize_sph_kernel_* methods. Fixes yt-project#2682

844222c

Xarthisius added a commit to Xarthisius/yt that referenced this issue Jul 22, 2020

Parallelize pixelize_sph_kernel_* methods. Fixes yt-project#2682

c88cc4c

Xarthisius added a commit to Xarthisius/yt that referenced this issue Jul 23, 2020

Parallelize pixelize_sph_kernel_* methods. Fixes yt-project#2682

3e21909

Xarthisius added a commit to Xarthisius/yt that referenced this issue Sep 17, 2020

Parallelize pixelize_sph_kernel_* methods. Fixes yt-project#2682

50adcbc

neutrinoceros closed this as completed in 9a3e858 Sep 24, 2020

neutrinoceros added a commit that referenced this issue Sep 24, 2020

Merge pull request #2683 from Xarthisius/2682_parallel_sph_pixelization

4d9ab24

Parallelize pixelize_sph_kernel_projection. Fixes #2682

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPH pixelization routine is slow #2682

SPH pixelization routine is slow #2682

Xarthisius commented Jun 24, 2020

matthewturk commented Jun 24, 2020

Xarthisius commented Jun 24, 2020

matthewturk commented Jun 24, 2020

AshKelly commented Jun 24, 2020

Xarthisius commented Jun 24, 2020

AshKelly commented Jun 24, 2020

Xarthisius commented Jun 24, 2020

SPH pixelization routine is slow #2682

SPH pixelization routine is slow #2682

Comments

Xarthisius commented Jun 24, 2020

Bug report

matthewturk commented Jun 24, 2020

Xarthisius commented Jun 24, 2020

matthewturk commented Jun 24, 2020

AshKelly commented Jun 24, 2020

Xarthisius commented Jun 24, 2020

AshKelly commented Jun 24, 2020

Xarthisius commented Jun 24, 2020