Performance improvement of _launch (code block 2: packing CArray) #280

ybsh · 2020-03-05T07:23:19Z

A subproblem of #153 .
This issue focuses on improvement of
this code block mentioned here.

ybsh · 2020-03-05T07:44:58Z

I'm trying to pin down the bottleneck lines by placing probes (time.perf_counter()) densely in the code block.

ybsh · 2020-03-05T09:26:17Z

I created a new branch 280-for-profile-launch-cb2 off 153-for-profile.
I added probes as follows:
ba580c301fca

I ran train_mnist.py (100 iterations).
Total execution time of this code block (ndarray_time): 0.258964 s

            ndim = len(a.strides) # 0.0254 s
            for d in range(ndim):
                if a.strides[d] % a.itemsize != 0:   # if block:  0.028543 s
                    raise ValueError("Stride of dim {0} = {1},"
                                     " but item size is {2}"
                                     .format(d, a.strides[d], a.itemsize))
                arrayInfo.shape_and_index[d] = a.shape[d]      #  0.019907 s
                arrayInfo.shape_and_index[d + ndim] = a.strides[d] # 0.018830 s
            arrayInfo.offset = a.data.cl_mem_offset() # 0.033951 s
            arrayInfo.size = a.size # 0.011860 s

ybsh · 2020-03-05T09:41:49Z

Executed 3 more times, and these five execution times do not differ much (the differences are at most about +/-10%).

1st trial	2nd trial	3rd trial
0.030271	0.027252	0.028729
0.021084	0.019101	0.020559
0.019596	0.018031	0.019892
0.032501	0.031892	0.032937
0.012183	0.011812	0.012556

ybsh · 2020-03-05T09:54:00Z

arrayInfo.offset = a.data.cl_mem_offset() takes the longest.
Its definition is here:
https://github.com/fixstars/clpy/blob/clpy/clpy/backend/memory.pyx/#L457-L461

y1r · 2020-03-11T05:13:42Z

I've tried this issue, I notice that reducing the overhead of this code block is difficult.

As @ybsh reported, the elapsed time of each line is almost the same (11 ~ 34 ms) so there is no hotspot.

I tried some optimizations but couldn't work:

for d, (shape, stride) in enumerate(zip(a.shape, a.strides)):
- It may reduce inc/dec ref count of Python Object a
- performance: no change
copy [*a.shape, *a.strides] to array.array, expand length of array.array instance.
- performance: become slower

I suggest changing arrayInfo structure (but I have no idea to deal).

LWisteria · 2020-03-11T05:29:15Z

How about in the case of CuPy? CuPy also stores ndarray to CArray.

y1r mentioned this issue Apr 8, 2020

Optimize _launch ndarray case by type declaration #285

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement of _launch (code block 2: packing CArray) #280

Performance improvement of _launch (code block 2: packing CArray) #280

ybsh commented Mar 5, 2020 •

edited

Loading

ybsh commented Mar 5, 2020

ybsh commented Mar 5, 2020 •

edited

Loading

ybsh commented Mar 5, 2020 •

edited

Loading

ybsh commented Mar 5, 2020

y1r commented Mar 11, 2020

LWisteria commented Mar 11, 2020

Performance improvement of _launch (code block 2: packing CArray) #280

Performance improvement of _launch (code block 2: packing CArray) #280

Comments

ybsh commented Mar 5, 2020 • edited Loading

ybsh commented Mar 5, 2020

ybsh commented Mar 5, 2020 • edited Loading

ybsh commented Mar 5, 2020 • edited Loading

ybsh commented Mar 5, 2020

y1r commented Mar 11, 2020

LWisteria commented Mar 11, 2020

ybsh commented Mar 5, 2020 •

edited

Loading

ybsh commented Mar 5, 2020 •

edited

Loading

ybsh commented Mar 5, 2020 •

edited

Loading