Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imread performance: reduced overhead of pim.open calls when reading from image sequence #182

Merged
merged 1 commit into from
Feb 19, 2021

Conversation

m-albert
Copy link
Collaborator

As reported by #181 the performance of dask_image.imread is bad in the case of reading from many input files. Also related are #121 #161.

One problem with the current implementation that I noticed is that when calling dask_image.imread.imread with a file pattern such as im_*.tif, for each tile that is loaded pims.open is called on the entire file pattern. This then leads to many unnecessary instantiations of pims.ImageSequenceNDs producing a large overhead.

In this proposed fix I use glob to match filenames and frames to call pims.open only on the files that are actually being loaded.

files. Solved by resolving input filenames using glob and encoding them
into a dask array used as input to the `map_blocks` reading step.
@m-albert
Copy link
Collaborator Author

Here some performance tests (inspired by #181):

Data prep:

import numpy as np
import skimage.io
import dask_image.imread
import glob
from dask import delayed
import dask.array as da

Nz, Ny, Nx = 1000, 100, 100
im = np.random.randint(0, 1000, (Nz, Ny, Nx))

for z in range(Nz):
    skimage.io.imsave('data/im_%05d.tif' %z, im[z])

Timings:

%%time
all_images = sorted(glob.glob("data/im_*.tif"))
imgs = []
for idx, image in enumerate(all_images):
    imgs.append(skimage.io.imread(image))
imgs = np.array(imgs)

Serial read without dask, 420ms

%%time
lazy_imread = delayed(skimage.io.imread)  # lazy reader
lazy_arrays = [lazy_imread(image) for image in all_images]
dask_arrays = [
    da.from_delayed(delayed_reader, shape=(Ny, Nx), dtype=np.uint16)
    for delayed_reader in lazy_arrays
]
using_dask = da.stack(dask_arrays, axis=0).compute()

Using dask delayed: 1.05s

%%time
lazy_imread = delayed(skimage.io.imread)  # lazy reader
lazy_arrays = [lazy_imread(image) for image in all_images]
dask_arrays = [
    da.from_delayed(delayed_reader, shape=(Ny, Nx), dtype=np.uint16)
    for delayed_reader in lazy_arrays
]
using_dask = da.stack(dask_arrays, axis=0).compute()

Master: 16s
This PR: 1.04s

@m-albert m-albert changed the title Imread: reduced overhead of pim.open calls when reading from many files Imread: reduced overhead of pim.open calls when reading from image sequence Jan 20, 2021
@m-albert m-albert changed the title Imread: reduced overhead of pim.open calls when reading from image sequence Imread performance: reduced overhead of pim.open calls when reading from image sequence Jan 20, 2021
Base automatically changed from master to main February 2, 2021 01:18
@GenevieveBuckley
Copy link
Collaborator

Genuinely sorry, not sure how this managed to fall off my radar. Adding it back to the to-do list now!

@GenevieveBuckley
Copy link
Collaborator

Looks good to me too.

At some point we should consider adding performance tests, but that's a conversation for another day.

@m-albert
Copy link
Collaborator Author

@GenevieveBuckley Great, thanks for reviewing and merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants