Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance in batch mode #21

Open
VolkerH opened this issue Mar 14, 2019 · 5 comments
Open

Performance in batch mode #21

VolkerH opened this issue Mar 14, 2019 · 5 comments

Comments

@VolkerH
Copy link
Owner

VolkerH commented Mar 14, 2019

For this particular dataset, the naive approach takes about 1s per image including read/write if I use this naive approach. I can see the GPU utilization going up and down as well.

image

In my commandline batch tool processing that same dataset is almost a factor of 4 slower:

image

That code does an additional affine transform and MIP but that should not make a significant difference. Maybe the overhead is due to passing around partially evaluated functions.

@VolkerH
Copy link
Owner Author

VolkerH commented Mar 14, 2019

partial function evaluation indeed incurs a performance penalty
https://stackoverflow.com/questions/17388438/python-functools-partial-efficiency
Still the slowdown is almost a factor of 4 which is difficult to explain just by function call overhead.

@VolkerH
Copy link
Owner Author

VolkerH commented Mar 15, 2019

Profiling with cProfile. This is for deconvolving 100 volumes with 10 iterations each. Actual deconvolution is about 0.75 s / frame. The next thing is a pyopencl call (probably related to deskew/rotate) taking nearly 0.4s. np.astype takes up a considerable amount of processing time as do reading and writing. Some of this stuff could probably be parallelized.

Fri Mar 15 11:57:54 2019    process_stats

         2275854 function calls (2263132 primitive calls) in 347.478 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100   75.846    0.758   75.846    0.758 {built-in method _pywrap_tensorflow_internal.TF_SessionRun_wrapper}
      101   39.725    0.393   39.725    0.393 {built-in method pyopencl._cl._enqueue_read_buffer}
      227   36.042    0.159   36.042    0.159 {built-in method numpy.core.multiarray.concatenate}
      922   34.942    0.038   34.942    0.038 {method 'astype' of 'numpy.ndarray' objects}
      101   32.304    0.320   32.305    0.320 /home/vhil0002/anaconda3/envs/newllsm/lib/python3.6/site-packages/pyopencl/__init__.py:872(image_init)
    10252   20.804    0.002   20.804    0.002 {method 'readinto' of '_io.BufferedReader' objects}
      101   20.215    0.200   20.215    0.200 {method 'tofile' of 'numpy.ndarray' objects}
      217   19.790    0.091   19.790    0.091 {built-in method numpy.core.multiarray.copyto}
      101   13.144    0.130   13.144    0.130 {method 'clip' of 'numpy.ndarray' objects}
      101   12.960    0.128   12.960    0.128 {built-in method pyopencl._cl.enqueue_nd_range_kernel}
      100    8.832    0.088  343.422    3.434 /home/vhil0002/anaconda3/envs/newllsm/lib/python3.6/site-packages/l

have run similar profiling for gputools RL deconv. There, most of the time is spent in np.astype

@VolkerH
Copy link
Owner Author

VolkerH commented Mar 15, 2019

some more comments about gputools deconvolve (seperate from flowdec). There are some obvious improvements that can be made, e.g. an FFT plan is calculated in the gputools implementation but never used. The PSF is pre-processed and sent to the GPU each time the deconvolution is called. Separating the deconvolution into an init and a run step would allow for processing the PSF once and leaving it on the GPU.

TODO: check whether flowdec sends the PSF to the CPU each time (I believe it does). Maybe that can be optimized as well.

@VolkerH
Copy link
Owner Author

VolkerH commented Mar 15, 2019

rewrote maweigerts gputools based convolution to reuse fft-plan, processed psf (remains in GPU ram) and temporary gpu-buffers. Removed unnecessary duplicate .astype(np.complex64).
See https://github.com/VolkerH/Lattice_Lightsheet_Deskew_Deconv/blob/benchmarking/lls_dd/deconv_gputools_rewrite.py

Major speed improvement. Actual deconvolution much faster than time per iteration. Will have to read/write from disk in seperate threads.
image

@VolkerH
Copy link
Owner Author

VolkerH commented Mar 15, 2019

Cprofile stats for the above three runs:
gputools rewrite:
image
gputools:
image
flowdec:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant