-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance in batch mode #21
Comments
partial function evaluation indeed incurs a performance penalty |
Profiling with cProfile. This is for deconvolving 100 volumes with 10 iterations each. Actual deconvolution is about 0.75 s / frame. The next thing is a pyopencl call (probably related to deskew/rotate) taking nearly 0.4s. np.astype takes up a considerable amount of processing time as do reading and writing. Some of this stuff could probably be parallelized.
have run similar profiling for gputools RL deconv. There, most of the time is spent in |
some more comments about gputools deconvolve (seperate from flowdec). There are some obvious improvements that can be made, e.g. an FFT plan is calculated in the gputools implementation but never used. The PSF is pre-processed and sent to the GPU each time the deconvolution is called. Separating the deconvolution into an init and a run step would allow for processing the PSF once and leaving it on the GPU. TODO: check whether flowdec sends the PSF to the CPU each time (I believe it does). Maybe that can be optimized as well. |
rewrote maweigerts gputools based convolution to reuse fft-plan, processed psf (remains in GPU ram) and temporary gpu-buffers. Removed unnecessary duplicate Major speed improvement. Actual deconvolution much faster than time per iteration. Will have to read/write from disk in seperate threads. |
For this particular dataset, the naive approach takes about 1s per image including read/write if I use this naive approach. I can see the GPU utilization going up and down as well.
In my commandline batch tool processing that same dataset is almost a factor of 4 slower:
That code does an additional affine transform and MIP but that should not make a significant difference. Maybe the overhead is due to passing around partially evaluated functions.
The text was updated successfully, but these errors were encountered: