-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For some schedulers, setting PIMS image reader's .class_priority
is ineffective in controlling dask-image.imread()
#262
Comments
I see how that would be useful. Have you tried using the dask.array.image.imread function (regular dask, not the one in dask-image)? It allows you to pass in your preferred reader function directly, which seems even easier than fiddling with the priority levels. from dask.array.image import imread
import pims
data = imread('path/to/files/*.tif', imread=pims.ImageIOReader) (Having two different imread functions in two different places kinda violates the python zen "There should be one-- and preferably only one --obvious way to do it", which I don't like. You can read some more discussion about that here if you like: #229) |
Let us know if that fixes your issue |
Hi @GenevieveBuckley , Thank you! Until now, I had been unaware of The API of Unfortunately though, as it is now, from dask.array.image import imread
import pims
video = imread('path/to/video.mp4', imread=pims.ImageIOReader) Click to see error messages:
|
|
Sure.
|
After updating the python package However, I noticed that Though obviously I can "squeeze" the first dimension out, I did not expect it in the first place since |
Hm, yes. It looks like |
This does work, from what I can see. It's just difficult to tell, because This is how I checked:
|
To summarize, this thread brings up two points:
Is there anything else I've missed, or you're still having trouble with? |
Thanks for your reply. Sorry, it seems my original post was not clear. What I meant was that I was aware that the following code-snippet does work for single and multi-threaded schedulers. But not for multi-process schedulers. And probably not for distributed-memory schedulers either. pims.ImageIOReader.class_priority = 100 # we set this very high in order to force dask's imread() to use this reader [via pims.open()]
rgb_frames = dask_image.imread.imread('/path/to/video/file.mpg') # uses ImageIOReader
rgb_frames.compute(scheduler='single-threaded') # works
rgb_frames.compute(scheduler='threading') # works
rgb_frames.compute(scheduler='processes') # does not work |
Yeah the single and thread schedulers share the same process memory. So if the priority is set in that process, that is sufficient. All workers view that same memory. With the process scheduler, different processes have their own process memory space and it isn't shared. Setting information in one does not necessarily get communicated to another. So one would need to do this during process startup. This is handled here. The simplest solution is to just provide your own An alternative solution, would be to add some kind of Distributed would likely have the same issue for the same reason. However there are a lot more options there. For example preload scripts would work. If you are planning on doing process based execution, would suggest just using Distributed. It has a centralized scheduler, the ability to work with |
Many thanks @jakirkham !
I followed your first suggestion since that was the easiest one to understand (as you guessed 😄). And it works (see the following code-snippet)! import dask_image
import pims
def initialize_worker_process():
"""
Initialize a worker process before running any tasks in it.
"""
# If Numpy is already imported, presumably its random state was
# inherited from the parent => re-seed it.
import sys
np = sys.modules.get("numpy")
if np is not None:
np.random.seed()
# We increase the priority of ImageIOReader in order to force dask's
# imread() to use this reader [via pims.open()]
pims.ImageIOReader.class_priority = 100
def get_pool_with_reader_priority_set(num_workers=None):
import os
from dask import config
from dask.system import CPU_COUNT
from dask.multiprocessing import get_context
from concurrent.futures import ProcessPoolExecutor
num_workers = num_workers or config.get("num_workers", None) or CPU_COUNT
if os.environ.get("PYTHONHASHSEED") in (None, "0"):
# This number is arbitrary; it was chosen to commemorate
# https://github.com/dask/dask/issues/6640.
os.environ["PYTHONHASHSEED"] = "6640"
context = get_context()
return ProcessPoolExecutor(
num_workers, mp_context=context, initializer=initialize_worker_process
)
rgb_frames = dask_image.imread.imread('/path/to/video/file.mpg')
rgb_frames.compute(scheduler='processes', pool=get_pool_with_reader_priority_set()) # uses ImageIOReader I suppose a PR that helps the end-user avoid getting his/her hands dirty with the innards of multi-process scheduler technology would be a good idea. But before that, perhaps I should try |
Would the following idea sit well with you? The idea is to add a new keyword argument, say so that we can later replace the following call currently within its body: pool = ProcessPoolExecutor(
num_workers, mp_context=context, initializer=initialize_worker_process
) with: pool = ProcessPoolExecutor(
num_workers, mp_context=context, initializer=initializer or initialize_worker_process
) This would enable the end-user to pass his/her own process initializer function to |
That seems like a reasonable starting point. There may be a few things to firm up, but it is probably easier to discuss these in a PR. Would suggest sending a draft PR to Dask and we can go from there 🙂 |
Initializer customization added in PR ( dask/dask#9087 ), which should be in the next Dask release |
cc: @jmdelahanty
Hi dask-image developers!
Normally an end-user may control which reader
pims.open()
uses to load images by simply increasing the.class_priority
attribute of their preferredpims
reader prior to callingpims.open()
. See this link.Since
dask-image.imread()
usespims.open()
, it would be great if it could mirror such functionality too.And indeed this functionality does work for
dask-image.imread()
in single-machine schedulers, like "threading" and "sync". But I do not know of a way to make all processes, in a multi-process scheduler, for example, aware of the preferred reader's increased.class_priority
. Any help here would be greatly appreciated.Alternatively, it might be an idea to modify
dask-image.imread()
to receive a "reader" keyword argument which indicates the end-user's preferred PIMS reader.The text was updated successfully, but these errors were encountered: