Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement of _launch (code block 2: packing CArray) #280

Open
ybsh opened this issue Mar 5, 2020 · 6 comments
Open

Performance improvement of _launch (code block 2: packing CArray) #280

ybsh opened this issue Mar 5, 2020 · 6 comments

Comments

@ybsh
Copy link
Collaborator

ybsh commented Mar 5, 2020

A subproblem of #153 .
This issue focuses on improvement of
this code block mentioned here.

@ybsh
Copy link
Collaborator Author

ybsh commented Mar 5, 2020

I'm trying to pin down the bottleneck lines by placing probes (time.perf_counter()) densely in the code block.

@ybsh
Copy link
Collaborator Author

ybsh commented Mar 5, 2020

I created a new branch 280-for-profile-launch-cb2 off 153-for-profile.
I added probes as follows:
ba580c301fca

I ran train_mnist.py (100 iterations).
Total execution time of this code block (ndarray_time): 0.258964 s

            ndim = len(a.strides) # 0.0254 s
            for d in range(ndim):
                if a.strides[d] % a.itemsize != 0:   # if block:  0.028543 s
                    raise ValueError("Stride of dim {0} = {1},"
                                     " but item size is {2}"
                                     .format(d, a.strides[d], a.itemsize))
                arrayInfo.shape_and_index[d] = a.shape[d]      #  0.019907 s
                arrayInfo.shape_and_index[d + ndim] = a.strides[d] # 0.018830 s
            arrayInfo.offset = a.data.cl_mem_offset() # 0.033951 s
            arrayInfo.size = a.size # 0.011860 s

@ybsh
Copy link
Collaborator Author

ybsh commented Mar 5, 2020

Executed 3 more times, and these five execution times do not differ much (the differences are at most about +/-10%).

1st trial 2nd trial 3rd trial
0.030271 0.027252 0.028729
0.021084 0.019101 0.020559
0.019596 0.018031 0.019892
0.032501 0.031892 0.032937
0.012183 0.011812 0.012556

@ybsh
Copy link
Collaborator Author

ybsh commented Mar 5, 2020

arrayInfo.offset = a.data.cl_mem_offset() takes the longest.
Its definition is here:
https://github.com/fixstars/clpy/blob/clpy/clpy/backend/memory.pyx/#L457-L461

@y1r
Copy link
Collaborator

y1r commented Mar 11, 2020

I've tried this issue, I notice that reducing the overhead of this code block is difficult.

As @ybsh reported, the elapsed time of each line is almost the same (11 ~ 34 ms) so there is no hotspot.

I tried some optimizations but couldn't work:

  • for d, (shape, stride) in enumerate(zip(a.shape, a.strides)):
    • It may reduce inc/dec ref count of Python Object a
    • performance: no change
  • copy [*a.shape, *a.strides] to array.array, expand length of array.array instance.
    • performance: become slower

I suggest changing arrayInfo structure (but I have no idea to deal).

@LWisteria
Copy link
Member

How about in the case of CuPy? CuPy also stores ndarray to CArray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants