Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transpose + np.copy() + (fancy indexing and/or ndarray.copy()) causes major slowdowns #322

Merged
merged 1 commit into from
Mar 6, 2023

Conversation

talonchandler
Copy link
Collaborator

Fixes waveorder #102.

TL;DR: transposes and copies are slowing down the recOrder->waveorder inferface

The recOrder -> waveorder adapter functions use transposes to bridge the gap between recOrder's CZYX order and waveorders CXYZ order. Transposes of >=3-dimensional arrays can result in non-contiguous views of the original arrays, and the copied versions also inhabit non-contiguous regions of memory. This means that waveorder is often receiving non-contiguous arrays.

numpy functions give the correct answers when you pass non-contiguous arrays, but the results are often much slower. Fancy indexing, and .copy operations are particularly slow when they operate on non-contiguous arrays since they need to seek across a large region of memory.

waveorder_reconstructor.Polarization_recon receives non-contiguous arrays and uses fancy indexing (e.g. sa_wrapped[ret_wrapped < 0] += np.pi / 2) and .copy operations (Recon_para[0] = ret_wrapped.copy()). The .copy operations are particularly slow when they receive non-contiguous arrays that they don't expect.

Confusingly, np.copy() has different default behavior than ndarray.copy() (np.copy matches the layout, while ndarray.copy assumes C-order and is recommended by numpy). Here's a minimal working example to illustrate the difference in behavior (and the core source of slowdown that this PR fixes):

Minimal example script (click to expand):
import numpy as np
import time

for i in [3,4,5]:
  Z = 2**i
  print(f"Slices = {Z}")

  A = np.random.random((4,Z,2048,2048))
  At = np.transpose(A, (0,2,3,1))
  At_copy = np.copy(At)
  import pdb; pdb.set_trace()

  t = time.time()
  copy1 = A[0].copy()
  print(f'A[0].copy(): \t\t{time.time() - t:.2f}')

  t = time.time()
  copy2 =  np.copy(At[0])
  print(f'np.copy(At[0]): \t{time.time() - t:.2f}')

  t = time.time()
  copy2 =  At[0].copy()
  print(f'At[0].copy(): \t\t{time.time() - t:.2f}')

Results in:

Slices = 8
A[0].copy(): 		0.04
np.copy(At[0]): 	0.04
At[0].copy(): 		0.07
Slices = 16
A[0].copy(): 		0.07
np.copy(At[0]): 	0.07
At[0].copy(): 		2.38
Slices = 32
A[0].copy(): 		0.14
np.copy(At[0]): 	0.14
At[0].copy(): 		19.19

This PR is a first step towards unravelling these issues. The Polarization_recon function doesn't need any transposes (even though recOrder applies them), so we can get the correct results quickly by removing the transposes and their inverses.

I suspect that we'll find other similar speedups across the recOrder -> waveorder interface, and I will start my search in commonly used areas with transposes and copies. The longest-term solutions will involve changing waveorder to use the CZYX order so that no transposes are necessary.

@codecov-commenter
Copy link

codecov-commenter commented Feb 28, 2023

Codecov Report

Merging #322 (1fb6ee5) into main (5d56b29) will increase coverage by 0.00%.
The diff coverage is 0.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@          Coverage Diff          @@
##            main    #322   +/-   ##
=====================================
  Coverage   4.22%   4.23%           
=====================================
  Files         23      23           
  Lines       5154    5150    -4     
=====================================
  Hits         218     218           
+ Misses      4936    4932    -4     
Impacted Files Coverage Δ
recOrder/compute/reconstructions.py 0.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@deprecated-napari-hub-preview-bot

Preview page for your plugin is ready here:
https://preview.napari-hub.org/mehta-lab/recOrder/322
Created: 2023-02-28T21:57:41.324888

Copy link
Contributor

@ziw-liu ziw-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find! I've always been a bit confused with the many np.copy() calls around recOrder and waveorder codebases. In this case a deep copy does seem redundant.

Another potential issue is precision: float64 is used liberally, and for uint16 raw data the benefit of doubles is not always worth the halved bandwidth, esp. for L3/RAM-intensive ops like FFTs.

@mattersoflight
Copy link
Member

@talonchandler Great find!

  • Is it possible that Polarization_Recon updates stokes, which may explain why stokes volume's deep copy is passed? Can you check that stokes remain unchanged after Polarization_Recon returns?
  • How much speed up do you observe as a function of # of z-slices?
  • I agree with @ziw-liu that single precision floats are sufficient for our reconstructions.

Copy link
Contributor

@edyoshikun edyoshikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talonchandler I am unable to get the speedups replacing birefringence = reconstructor.Polarization_recon(stokes) with birefringence = reconstructor.Polarization_recon(stokes).

I'm still getting relatively similar results with increasing z even the new code looks good and makes sense. Am I missing something?

I was changing pol_data = reader.get_zarr(0)[t, 1:6, z_start:z_end] to different intervals for different Z positions.

Here is some test code:

#%%
from recOrder.io.utils import load_bg
from recOrder.compute.reconstructions import (
    initialize_reconstructor,
    reconstruct_qlipp_stokes,
    reconstruct_qlipp_birefringence,
    reconstruct_phase3D,
)
from datetime import datetime
import numpy as np
import os
import sys
from iohub.reader import ImageReader
#debugging
import time
import cProfile as profile
import pstats

# %%
# Load the Data
data_root = "/hpc/projects/comp_micro/projects/zebrafish-infection/2023_02_02_hummingbird_casper_GFPmacs"
dataset = "intoto_casper_short_1_2023_02_08_110049.zarr"
dataset_folder = os.path.join(data_root, dataset)
print('data:' + dataset_folder)
# Background folder name
bg_root = "/hpc/projects/comp_micro/rawdata/hummingbird/Ed/2023_02_02_zebrafish_casper"
bg_folder = os.path.join(bg_root, "BG_3")
# %%
#Setup Readers
reader = ImageReader(dataset_folder)
t = 0
pos = 0
pol_data = reader.get_zarr(0)[t, 1:6]
C, Z, Y, X  = pol_data.shape
print(pol_data.shape)

# fluor_data = data[6:]
bg_data = load_bg(bg_folder, height= Y, width= X)

# %%
# Get first position
print("Start Reconstructor")
reconstructor_args = {
    "image_dim": (Y, X),
    "mag": 16,  # magnification   is 200/165
    "pixel_size_um": 3.45,  # pixel size in um
    "wavelength_nm": 532,
    "swing": 0.08,
    "calibration_scheme": "5-State",  # "4-State" or "5-State"
    "NA_obj": 0.55,  # numerical aperture of objective
    "NA_illu": 0.45,  # numerical aperture of condenser
    "n_obj_media": 1.0,  # refractive index of objective immersion media
    "bg_correction": "None",  # BG correction method: "None", "local_fit", "global"
}

reconstructor = initialize_reconstructor(
    pipeline="birefringence", **reconstructor_args
)
    
# Reconstruct data Stokes w/ background correction
print("Begin stokes calculation")
start_time = time.time()
# prof = profile.Profile()
# prof.enable()
# bg_stokes = reconstruct_qlipp_stokes(bg_data,reconstructor)
# stokes = reconstruct_qlipp_stokes(pol_data, reconstructor,bg_stokes)
stokes = reconstruct_qlipp_stokes(pol_data, reconstructor, bg_stokes=None)
# prof.disable()
print(f'Stokes elapsed time:{time.time() - start_time}')
print(f"Shape of background corrected data Stokes: {np.shape(stokes)}")
# Reconstruct Birefringence from Stokes
print("Begin birefringence calculation")
bire_start_time = time.time()
birefringence = reconstructor.Polarization_recon(stokes)
birefringence[0] = (
    birefringence[0] / (2 * np.pi) * reconstructor_args["wavelength_nm"]
)
print(f'Bire elapsed time:{time.time() - bire_start_time}')
print(f'Total elapsed time:{time.time() - start_time}')
# %%
# print profiling output
# stats = pstats.Stats(prof).strip_dirs().sort_stats("cumtime")
# stats.print_stats() # top 10 rows

@talonchandler
Copy link
Collaborator Author

talonchandler commented Mar 2, 2023

Hi Ed,

I just ran a comparison of
Old: birefringence = reconstruct_qlipp_birefringence(stokes, reconstructor) [before this PR]
New: birefringence = reconstructor.Polarization_Recon(stokes)

Results:

10 slices:
Stokes: 25 s
Birefringence (new): 1.7 s
Birefringence (old): 3.7 s [~2x]

20 slices:
Stokes: 55 s
Birefringence (new): 3 s
Birefringence (old): 20 s [~6x]

40 slices:
Stokes: 111 s
Birefringence (new): 7 s
Birefringence (old): 111 s [~16x]

As the number of slices grows, the speed improvement increases. (Note that the "Birefringence (old)" numbers match your results in mehta-lab/waveorder#102).

Let me know if you can't reproduce this result.

Copy link
Contributor

@edyoshikun edyoshikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @talonchandler. Yes, I can confirm there is a boost in speed and that this fixes the slowdown I was experiencing. Great debugging!!

One thing to note and perhaps it should become its separate issue is that using the scatter-gather method here given that I'm just throwing silicon (64 cores) at it we can do 40slices(stokes+bire) in 9 secs with each process taking 0.8s for stokes and .4 for birefringence.

@talonchandler
Copy link
Collaborator Author

talonchandler commented Mar 2, 2023

@mattersoflight

Is it possible that Polarization_Recon updates stokes, which may explain why stokes volume's deep copy is passed? Can you check that stokes remain unchanged after Polarization_Recon returns?

I just confirmed that Polarization_recon does not modify it's input, and the output is stored in a separate block of memory. Thanks for bringing up this possibility.

How much speed up do you observe as a function of # of z-slices?

For this specific function's input with size Z x 2000 x 2000 I'm seeing:
2x for Z = 10
6x for Z = 20
16x for Z = 40
17x for Z = 80
The improvement in this PR saturates at ~17-fold improvement and moves the bottleneck to the application of A_inv---(see @edyoshikun's comments and his new issue #323).

I agree with @ziw-liu that single precision floats are sufficient for our reconstructions.

I agree. I'll open a separate issue.

Copy link
Member

@mattersoflight mattersoflight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a stroke of genius in dealing with stokes parameters.

@talonchandler
Copy link
Collaborator Author

Just did a final test on hummingbird...everything's working well here. Merging.

@talonchandler talonchandler added this pull request to the merge queue Mar 6, 2023
Merged via the queue into main with commit 6742be9 Mar 6, 2023
@ziw-liu ziw-liu deleted the fix-transpose-copy-slowdown branch March 14, 2023 04:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow birefringence reconstruction for z-stacks
5 participants