Transpose + np.copy() + (fancy indexing and/or `ndarray.copy()`) causes major slowdowns #322

talonchandler · 2023-02-28T21:50:42Z

TL;DR: transposes and copies are slowing down the recOrder->waveorder inferface

The recOrder -> waveorder adapter functions use transposes to bridge the gap between recOrder's CZYX order and waveorders CXYZ order. Transposes of >=3-dimensional arrays can result in non-contiguous views of the original arrays, and the copied versions also inhabit non-contiguous regions of memory. This means that waveorder is often receiving non-contiguous arrays.

numpy functions give the correct answers when you pass non-contiguous arrays, but the results are often much slower. Fancy indexing, and .copy operations are particularly slow when they operate on non-contiguous arrays since they need to seek across a large region of memory.

waveorder_reconstructor.Polarization_recon receives non-contiguous arrays and uses fancy indexing (e.g. sa_wrapped[ret_wrapped < 0] += np.pi / 2) and .copy operations (Recon_para[0] = ret_wrapped.copy()). The .copy operations are particularly slow when they receive non-contiguous arrays that they don't expect.

Confusingly, np.copy() has different default behavior than ndarray.copy() (np.copy matches the layout, while ndarray.copy assumes C-order and is recommended by numpy). Here's a minimal working example to illustrate the difference in behavior (and the core source of slowdown that this PR fixes):

Minimal example script (click to expand):

import numpy as np
import time

for i in [3,4,5]:
  Z = 2**i
  print(f"Slices = {Z}")

  A = np.random.random((4,Z,2048,2048))
  At = np.transpose(A, (0,2,3,1))
  At_copy = np.copy(At)
  import pdb; pdb.set_trace()

  t = time.time()
  copy1 = A[0].copy()
  print(f'A[0].copy(): \t\t{time.time() - t:.2f}')

  t = time.time()
  copy2 =  np.copy(At[0])
  print(f'np.copy(At[0]): \t{time.time() - t:.2f}')

  t = time.time()
  copy2 =  At[0].copy()
  print(f'At[0].copy(): \t\t{time.time() - t:.2f}')

Results in:

Slices = 8
A[0].copy(): 		0.04
np.copy(At[0]): 	0.04
At[0].copy(): 		0.07
Slices = 16
A[0].copy(): 		0.07
np.copy(At[0]): 	0.07
At[0].copy(): 		2.38
Slices = 32
A[0].copy(): 		0.14
np.copy(At[0]): 	0.14
At[0].copy(): 		19.19

This PR is a first step towards unravelling these issues. The Polarization_recon function doesn't need any transposes (even though recOrder applies them), so we can get the correct results quickly by removing the transposes and their inverses.

I suspect that we'll find other similar speedups across the recOrder -> waveorder interface, and I will start my search in commonly used areas with transposes and copies. The longest-term solutions will involve changing waveorder to use the CZYX order so that no transposes are necessary.

codecov-commenter · 2023-02-28T21:54:20Z

Codecov Report

Merging #322 (1fb6ee5) into main (5d56b29) will increase coverage by 0.00%.
The diff coverage is 0.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@          Coverage Diff          @@
##            main    #322   +/-   ##
=====================================
  Coverage   4.22%   4.23%           
=====================================
  Files         23      23           
  Lines       5154    5150    -4     
=====================================
  Hits         218     218           
+ Misses      4936    4932    -4

Impacted Files	Coverage Δ
recOrder/compute/reconstructions.py	`0.00% <0.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

deprecated-napari-hub-preview-bot · 2023-02-28T21:57:49Z

Preview page for your plugin is ready here:
https://preview.napari-hub.org/mehta-lab/recOrder/322
Created: 2023-02-28T21:57:41.324888

ziw-liu

Great find! I've always been a bit confused with the many np.copy() calls around recOrder and waveorder codebases. In this case a deep copy does seem redundant.

Another potential issue is precision: float64 is used liberally, and for uint16 raw data the benefit of doubles is not always worth the halved bandwidth, esp. for L3/RAM-intensive ops like FFTs.

mattersoflight · 2023-03-01T05:31:06Z

@talonchandler Great find!

Is it possible that Polarization_Recon updates stokes, which may explain why stokes volume's deep copy is passed? Can you check that stokes remain unchanged after Polarization_Recon returns?
How much speed up do you observe as a function of # of z-slices?
I agree with @ziw-liu that single precision floats are sufficient for our reconstructions.

edyoshikun

@talonchandler I am unable to get the speedups replacing birefringence = reconstructor.Polarization_recon(stokes) with birefringence = reconstructor.Polarization_recon(stokes).

I'm still getting relatively similar results with increasing z even the new code looks good and makes sense. Am I missing something?

I was changing pol_data = reader.get_zarr(0)[t, 1:6, z_start:z_end] to different intervals for different Z positions.

Here is some test code:

#%%
from recOrder.io.utils import load_bg
from recOrder.compute.reconstructions import (
    initialize_reconstructor,
    reconstruct_qlipp_stokes,
    reconstruct_qlipp_birefringence,
    reconstruct_phase3D,
)
from datetime import datetime
import numpy as np
import os
import sys
from iohub.reader import ImageReader
#debugging
import time
import cProfile as profile
import pstats

# %%
# Load the Data
data_root = "/hpc/projects/comp_micro/projects/zebrafish-infection/2023_02_02_hummingbird_casper_GFPmacs"
dataset = "intoto_casper_short_1_2023_02_08_110049.zarr"
dataset_folder = os.path.join(data_root, dataset)
print('data:' + dataset_folder)
# Background folder name
bg_root = "/hpc/projects/comp_micro/rawdata/hummingbird/Ed/2023_02_02_zebrafish_casper"
bg_folder = os.path.join(bg_root, "BG_3")
# %%
#Setup Readers
reader = ImageReader(dataset_folder)
t = 0
pos = 0
pol_data = reader.get_zarr(0)[t, 1:6]
C, Z, Y, X  = pol_data.shape
print(pol_data.shape)

# fluor_data = data[6:]
bg_data = load_bg(bg_folder, height= Y, width= X)

# %%
# Get first position
print("Start Reconstructor")
reconstructor_args = {
    "image_dim": (Y, X),
    "mag": 16,  # magnification   is 200/165
    "pixel_size_um": 3.45,  # pixel size in um
    "wavelength_nm": 532,
    "swing": 0.08,
    "calibration_scheme": "5-State",  # "4-State" or "5-State"
    "NA_obj": 0.55,  # numerical aperture of objective
    "NA_illu": 0.45,  # numerical aperture of condenser
    "n_obj_media": 1.0,  # refractive index of objective immersion media
    "bg_correction": "None",  # BG correction method: "None", "local_fit", "global"
}

reconstructor = initialize_reconstructor(
    pipeline="birefringence", **reconstructor_args
)
    
# Reconstruct data Stokes w/ background correction
print("Begin stokes calculation")
start_time = time.time()
# prof = profile.Profile()
# prof.enable()
# bg_stokes = reconstruct_qlipp_stokes(bg_data,reconstructor)
# stokes = reconstruct_qlipp_stokes(pol_data, reconstructor,bg_stokes)
stokes = reconstruct_qlipp_stokes(pol_data, reconstructor, bg_stokes=None)
# prof.disable()
print(f'Stokes elapsed time:{time.time() - start_time}')
print(f"Shape of background corrected data Stokes: {np.shape(stokes)}")
# Reconstruct Birefringence from Stokes
print("Begin birefringence calculation")
bire_start_time = time.time()
birefringence = reconstructor.Polarization_recon(stokes)
birefringence[0] = (
    birefringence[0] / (2 * np.pi) * reconstructor_args["wavelength_nm"]
)
print(f'Bire elapsed time:{time.time() - bire_start_time}')
print(f'Total elapsed time:{time.time() - start_time}')
# %%
# print profiling output
# stats = pstats.Stats(prof).strip_dirs().sort_stats("cumtime")
# stats.print_stats() # top 10 rows

talonchandler · 2023-03-02T18:05:22Z

Hi Ed,

I just ran a comparison of
Old: birefringence = reconstruct_qlipp_birefringence(stokes, reconstructor) [before this PR]
New: birefringence = reconstructor.Polarization_Recon(stokes)

Results:

10 slices:
Stokes: 25 s
Birefringence (new): 1.7 s
Birefringence (old): 3.7 s [~2x]

20 slices:
Stokes: 55 s
Birefringence (new): 3 s
Birefringence (old): 20 s [~6x]

40 slices:
Stokes: 111 s
Birefringence (new): 7 s
Birefringence (old): 111 s [~16x]

As the number of slices grows, the speed improvement increases. (Note that the "Birefringence (old)" numbers match your results in mehta-lab/waveorder#102).

Let me know if you can't reproduce this result.

edyoshikun

Thanks @talonchandler. Yes, I can confirm there is a boost in speed and that this fixes the slowdown I was experiencing. Great debugging!!

One thing to note and perhaps it should become its separate issue is that using the scatter-gather method here given that I'm just throwing silicon (64 cores) at it we can do 40slices(stokes+bire) in 9 secs with each process taking 0.8s for stokes and .4 for birefringence.

talonchandler · 2023-03-02T19:33:15Z

@mattersoflight

Is it possible that Polarization_Recon updates stokes, which may explain why stokes volume's deep copy is passed? Can you check that stokes remain unchanged after Polarization_Recon returns?

I just confirmed that Polarization_recon does not modify it's input, and the output is stored in a separate block of memory. Thanks for bringing up this possibility.

How much speed up do you observe as a function of # of z-slices?

For this specific function's input with size Z x 2000 x 2000 I'm seeing:
2x for Z = 10
6x for Z = 20
16x for Z = 40
17x for Z = 80
The improvement in this PR saturates at ~17-fold improvement and moves the bottleneck to the application of A_inv---(see @edyoshikun's comments and his new issue #323).

I agree with @ziw-liu that single precision floats are sufficient for our reconstructions.

I agree. I'll open a separate issue.

mattersoflight

a stroke of genius in dealing with stokes parameters.

talonchandler · 2023-03-06T21:57:03Z

Just did a final test on hummingbird...everything's working well here. Merging.

remove transpose+copy

1fb6ee5

talonchandler requested review from ziw-liu, edyoshikun and mattersoflight February 28, 2023 21:50

ziw-liu approved these changes Feb 28, 2023

View reviewed changes

edyoshikun reviewed Mar 2, 2023

View reviewed changes

edyoshikun approved these changes Mar 2, 2023

View reviewed changes

edyoshikun mentioned this pull request Mar 2, 2023

Implement multiprocess for reconstructions #323

Closed

mattersoflight approved these changes Mar 3, 2023

View reviewed changes

talonchandler added this pull request to the merge queue Mar 6, 2023

Merged via the queue into main with commit 6742be9 Mar 6, 2023

ziw-liu deleted the fix-transpose-copy-slowdown branch March 14, 2023 04:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transpose + np.copy() + (fancy indexing and/or `ndarray.copy()`) causes major slowdowns #322

Transpose + np.copy() + (fancy indexing and/or `ndarray.copy()`) causes major slowdowns #322

talonchandler commented Feb 28, 2023

codecov-commenter commented Feb 28, 2023 •

edited

Loading

deprecated-napari-hub-preview-bot bot commented Feb 28, 2023

ziw-liu left a comment

mattersoflight commented Mar 1, 2023

edyoshikun left a comment •

edited

Loading

talonchandler commented Mar 2, 2023 •

edited

Loading

edyoshikun left a comment

talonchandler commented Mar 2, 2023 •

edited

Loading

mattersoflight left a comment

talonchandler commented Mar 6, 2023

Transpose + np.copy() + (fancy indexing and/or ndarray.copy()) causes major slowdowns #322

Transpose + np.copy() + (fancy indexing and/or ndarray.copy()) causes major slowdowns #322

Conversation

talonchandler commented Feb 28, 2023

codecov-commenter commented Feb 28, 2023 • edited Loading

Codecov Report

deprecated-napari-hub-preview-bot bot commented Feb 28, 2023

ziw-liu left a comment

Choose a reason for hiding this comment

mattersoflight commented Mar 1, 2023

edyoshikun left a comment • edited Loading

Choose a reason for hiding this comment

talonchandler commented Mar 2, 2023 • edited Loading

edyoshikun left a comment

Choose a reason for hiding this comment

talonchandler commented Mar 2, 2023 • edited Loading

mattersoflight left a comment

Choose a reason for hiding this comment

talonchandler commented Mar 6, 2023

Transpose + np.copy() + (fancy indexing and/or `ndarray.copy()`) causes major slowdowns #322

Transpose + np.copy() + (fancy indexing and/or `ndarray.copy()`) causes major slowdowns #322

codecov-commenter commented Feb 28, 2023 •

edited

Loading

edyoshikun left a comment •

edited

Loading

talonchandler commented Mar 2, 2023 •

edited

Loading

talonchandler commented Mar 2, 2023 •

edited

Loading