-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask_image imread performance issue #181
Comments
Hi @jmontoyam, you're completely right that I had a look at your example and found that the problem is that dask-image/dask_image/imread/_utils.py Lines 12 to 14 in 695bc43
I guess for the application you're working on your "pure dask" code does the trick. As you also commented, it's okay that it's a bit slower than the sequential non-dask version, since its primary purpose consists of providing lazy data chunks. For |
Thank you very much @m-albert!, I am happy to contribute for the first time to this amazing project, even though it is just a tiny bug report |
@jmontoyam That's a good question. Regarding the Regarding the |
Sorry, clicked the wrong button! |
I've just merged #182 from @m-albert now. Since it seems like that fixes @jmontoyam's problem, I'll close this issue now. Feel free to reopen this issue if I'm wrong about that. |
Reopening this issue, based on more recent discussions about the performance of imread. From @m-albert in #194 (comment)
Here's a performance comparison: import skimage.io
import pims
def pimread(fn):
with pims.open(fn) as imgs:
return np.asarray(imgs[0])
%%time
for i in range(10):
dask_image.imread.imread(os.path.join(folder,'im_*.tif')).compute(scheduler='threads')
# 8.7s Nz, Ny, Nx = 1000, 100, 100
%%time
for i in range(10):
dask.array.image.imread(os.path.join(folder,'im_*.tif'), imread=pimread).compute(scheduler='threads')
# 6.8s Nz, Ny, Nx = 1000, 100, 100
%%time
for i in range(10):
dask.array.image.imread(os.path.join(folder,'im_*.tif'), imread=skimage.io.imread).compute(scheduler='threads')
# 5.1s Nz, Ny, Nx = 1000, 100, 100 So For array creation, since %%time
dask_image.imread.imread(os.path.join(folder,'im_*.tif'))
# 470ms Nz, Ny, Nx = 2, 10000, 10000
# 250ms Nz, Ny, Nx = 10000, 10, 10
dask.array.image.imread(os.path.join(folder,'im_*.tif'))
# 1.05s Nz, Ny, Nx = 2, 10000, 10000
# 50ms Nz, Ny, Nx = 10000, 10, 10 From @jakirkham in #194 (comment)
|
That's a good idea. Given the good performance of try:
from skimage.io import imread as imread_func
except (AttributeError, ImportError):
def imread_func(fn):
return np.array(pims.open(fn))
pass |
Yeah we are already thinking about having some scikit-image style APIs. So adding scikit-image as a dependency makes sense |
cc @jni (in case you have thoughts here 🙂) |
Doesn't scikit-image use One thing I wish we had was a lightweight, minimal dependency way to install just an image reader to get stuff into a dask array. I'm not sure if there's a really clean way to handle this. |
Does it? I got the impression it was more complicated than that. Feel free to contradict me though if that's not accurate (guessing you or Juan would know better) Yeah we've discussed with imageio before if they could provide the shape and dtype without loading the full image ( imageio/imageio#362 ). This would actually let it take the place of PIMS for Dask Array construction and handle the data loading. Unfortunately that hasn't been solved yet. |
One thing that is worth noting here is PIMS does seem to be fairly clever about loading a range of pages from a TIFF, which can be handy if the TIFFs are massive. Loading a smaller range of pages can result in more manageable chunks for Dask. While I have seen this use case in the wild, not everyone does this. I'm guessing it is more a function of the acquisition software people are using. Not saying we should tailor things to that use case, but had forgotten this detail until we started discussing this recently. So wanted to share that context |
I think you're probably right. |
That's a really important point. This seems like a good point to mention Nick's earlier question about loading movie files efficiently #134, which is also an important use case to keep in mind. For what it's worth, I'm collaborating with a group who are working with |
Thanks for reminding me about Nick's use case. If we do care about optimizing both use cases, we could try to detect when chunks are smaller than the original files and handle these with PIMS. Otherwise just load the whole thing into memory with scikit-image (or something else we decide to use here). This could get a little tricky, but think this is doable. Not sure whether this will be easier to do during graph construction or at image load time. So probably whichever is easier would be the way to go initially and we could change things if there's a performance benefit later |
I do have opinions here! (1) skimage.io will eventually become a thin wrapper around imageio. None of this is particularly helpful re dask-image's present choice, except to say that maybe some/all of the effort in this discussion should go towards those issues rather than towards adding Yet Another way of wrapping wrappers around IO libraries. Re tifffile, for reading tiffs it always boils down to that (whether you're using imageio or skimage.io), so if you want to do lazy loading of big tiffs I suggest implementing it on top of tifffile directly — it certainly has that capability, no need for PIMS here. |
@GenevieveBuckley One can also envision a dask plugin/wrapper to allow loading the image straight into a dask_image and avoid a copy. I'm thinking about doing this for pytorch and tensorflow at some point down the line (directly load into pinned memory) because my current lab does a lot of deep learning with images.
This is also something that can be done with imageio 🤩. We maintain an ffmpeg wrapper ( Thoughts and feature requests are (of course) appreciated.
@jakirkham It will 👼 as soon as I get imageio/imageio#574 merged and find the time to write the wrapper for skimage.
Metadata for images is a never-ending story xD I think the reason there is no clear standard for it yet in imageio is that every format has its own set of metadata, so it is non-trivial to find common ground that we can guarantee to provide for all formats. For me specifically, the user-side is a bit of a black box, because I've not really seen use-cases yet where people actively consume metadata; then again I've only recently joined this particular corner of the internet, so there is a lot I may not know (yet). |
@jakirkham are there any disadvantages you see in @FirefoxMetzger's comment? |
A quick update/question (to bring this thread back from the dead): I have started looking into adding
Also, if I should not necromance this thread and instead start a new one to discuss this, please let me know and I'll do that :D |
Sorry, but I don't really follow. This doesn't seem relevant to Dask AFAICT. It isn't NumPy. Dask Array's are lazy and do not themselves support the Python Buffer Protocol. Individual Dask chunks would be created by asking ImageIO to open a file. Generally Dask Arrays expect NumPy or NumPy-like chunks. Also Dask defers to other libraries to create memory they need to consume. From the Dask perspective, the really important thing here is Dask needs to first construct a task graph, which contains accurate metadata. Namely the Ideally we want to avoid loading images (slow) to get that metadata. Reading headers of files would be ideal. At least from Dask's perspective, it doesn't really need to know more than
|
Oh, in that case, sorry for the noise. I was under the impression that images are loaded centrally and that chunks are then distributed as needed. Since this is not the case, my mental model is off and this is indeed not too relevant.
I think we can do that. It is possible for the major plugins which cover all commonly used image formats, so that should get us 90% there. I'll look into this once my current round of PRs is merged 👍 |
No worries. Dask would run the loading step on workers that would use the data directly. In general Dask tries to minimize communication both in its usage model and when distributing tasks to workers as communication can get pretty expensive. While I can imagine use cases that might benefit, unfortunately I don't think Dask is one of them. That would be incredibly useful! Thank you 😄 |
More discussion at dask/dask#8385 |
Dear dask_image community,
I am a new dask_image user. Maybe, due to my beginner level, I am doing something wrong, but I noticed that reading a collection of images using dask_image is much slower that using single-threaded skimage. I have installed the latest dask_image version available in pypi (dask_image version 0.4.0). In the following example, I am reading 398 images, all of them with the same dimension (64x10240, uint16). Taking into account the dimensions and numbers of images, I would expect dask_image to be slighly slower that single-threaded skimage (due to the tiny dask overhead involved in opening this small number of "tiny images"), but instead the performance of dask_image is much slower (around 24x). Then I proceed to implement the image reading function in pure-dask and the performance is much better than the one obtained with dask_image. In the following I will report the benchmarks results (all the following code-snippets load the same data successfully):
import glob
import numpy as np
import skimage.io
import dask_image.imread
from dask import delayed
import dask.array as da
Single-threaded-skimage baseline
Elapsed time: 510 milliseconds
Using dask_image
Elapsed time: 12.1 seconds
Using pure-dask
Elapsed time: 1.09 seconds
Using dask-image with synchronous scheduler
Elapsed time: 3 seconds
Using dask-image with processes scheduler
Elapsed time: 6.63 seconds
Using dask-image with threads scheduler
Elapsed time: 12 seconds
Environment:
Thank you very much for your all help ;)
The text was updated successfully, but these errors were encountered: