-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Locks and chunked storage #5083
Comments
I don't know much about the store function (you're right that @jakirkham is maybe the right person there). But on the lock side you should be able to give the |
Is the chunk structure of your dask array the same as the chunk structure of your zarr array? If so, shouldn't each chunk only be written exactly once? Why would you need to prevent multiple tasks writing to the same chunk if there is only ever one task per chunk? Having the chunk structures aligned generally seems like a good idea to me. But if they're not, what about https://zarr.readthedocs.io/en/stable/tutorial.html#parallel-computing-and-synchronization? I bet it would be only a few lines to write something equivalent to a |
Ideally, yes; typically, no. My dask arrays are really big (TB), so small chunks would be very inefficient for dask. But for storage (e.g., zarr), small chunks are ideal because we want to visualize the data with a chunk-aware visualization tool (neuroglancer). So I aim for a one-to-many relationship between dask chunks and zarr chunks, e.g. 512 x 512 x 512 on the dask side and 64 x 64 x 64 on the zarr side for uint8 data. This is all fine until I do something in dask that makes my chunks shrink (downsampling for making a multiresolution pyramid). If I downsample my 512-sized chunk by half 4 times, I end up with a 32-sized chunk, and that cannot be safely written to my 64-sized zarr chunks. Rechunking is a natural solution, but this moves data between workers and I would prefer to avoid this if possible. So I'm looking at locking as another approach. Regarding the syncronizers in the |
You might want to try this again with the latest distributed version now that #4967 has been released. Some of these "dask is moving data around a lot" situations should happen a lot less.
Sorry, I was suggesting writing your own class with the same interface as import distributed
class DistributedSynchronizer:
"""Provides synchronization using locks from Distributed.
Parameters
----------
prefix: string
Prefix the names of all dask.Distributed locks with this.
"""
def __init__(self, prefix: str = ""):
self.prefix = ""
def __getitem__(self, item):
return distributed.Lock(self.prefix + str(item)) >>> synchronizer = DistributedSynchronizer()
>>> z = zarr.open_array('data/example', mode='w', shape=(10000, 10000),
... chunks=(1000, 1000), dtype='i4',
... synchronizer=synchronizer) |
Yeah, big +1 on this. I think that the recent memory scheduling things may have had a large positive impact on rechunking. |
I would have expected a vanilla with dask.distributed.Lock(chunk_id):
# write chunk |
the vanilla lock works in the special case where each worker needs to lock just 1 chunk, but in the more general case when a worker attempts to write a region that spans multiple chunks, that worker must iteratively lock the chunk files spanned by the region, which introduces a potential deadlock due to the iteration, and I suspect the MultiLock addresses this (but I could be wrong). |
Ah, got it. That makes sense. |
And I should add that I really like the idea of using the existing synchronizer api in zarr. I will look into implementing that. |
I do agree with @gjoseph92 though that rechunking seems maybe better. I would be curious to learn more about what is pushing you away from that path. |
I think the code I posted handles these issues, since I believe that's what the Zarr synchronization API is all about. It looks like you just have to tell Zarr how to get the lock instance to use for a particular chunk of the dataset, and it manages the rest. |
Ideally would probably try to |
I hope this illustrates why I'm trying to avoid rechunking. I'm working with task graphs that look like this. A big (500 MB) piece of data enters at the bottom, then gets downscaled and saved recursively. The input chunks need to be big because I have a lot of them. I'm working on datasets that exceed 10 TB, so chunks need to be large to avoid painful task graph creation times. Each downscaling step reduces the chunk size, which is bad when the dask chunk size dips below my zarr chunk size. Now my workers can't do chunk-aligned writes. If I solve this problem with rechunking, the above task graph transforms to this: As you can see, rechunking introduces data dependencies between the end leaves of the task graph and breaks the nice parallelism we had before. I suspect this is the cause of some memory struggles my workers are experiencing (and these issues persisted and / or got worse on the latest version of dask). I suspect that locking zarr chunks lets me keep the first graph. Although locking zarr chunks would break parallelism, it should do so in a way that only increases runtime and not memory use. Hence my interest. |
You're right that rechunking will introduce transfers which might cause
memory issues. Recent versions of dask (2021-07 or later) greatly minimize
the effect of these issues
…On Tue, Jul 20, 2021 at 9:59 AM Davis Bennett ***@***.***> wrote:
I hope this illustrates why I'm trying to avoid rechunking.
I'm working with task graphs that look like this. A big (500 MB) piece of
data enters at the bottom, then gets downscaled and saved recursively. The
input chunks need to be big because I have a *lot* of them. I'm working
on datasets that exceed 10 TB, so chunks need to be large to avoid painful
task graph creation times.
[image: image]
<https://user-images.githubusercontent.com/3805136/126364246-e2eab385-a9e6-403d-b67a-342004f23e90.png>
Each downscaling step reduces the chunk size, which is bad when the dask
chunk size dips below my zarr chunk size. Now my workers can't do
chunk-aligned writes. If I solve this problem with rechunking, the above
task graph transforms to this:
[image: image]
<https://user-images.githubusercontent.com/3805136/126364465-5298c077-3722-4a2f-975e-c763bbbcdf62.png>
As you can see, rechunking introduces data dependencies between the end
leaves of the task graph and breaks the nice parallelism we had before. I
suspect this is the cause of some memory struggles my workers are
experiencing (and these issues persisted and / or got worse on the latest
version of dask).
I suspect that locking zarr chunks lets me keep the first graph. Although
locking zarr chunks would break parallelism, it should do so in a way that
only increases runtime and not memory use. Hence my interest.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5083 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEOSG73UKEPYTOI2ETTYWTNVANCNFSM5AQOCA3Q>
.
|
I found that my memory issues got worse, not better, after |
Have you tried |
If it's not too much trouble, I'd love to see a before/after performance report for It's true that it's a bit silly to move the data between workers just for the purpose of concatenating and storing it. Having workers synchronize writing the data they already have is simpler in many ways. I think we just all want this rechunking to perform better, so we're eager to understand why it's not working 😄 |
It's not too much trouble, working on it now (however, for some reason the performance report doesn't render correctly for me on 2021.07.0) |
xref: #5097 |
performance report for In lieu of the report, I can tell you that the computation run on |
Dang. The performance report for 06 looks pretty good to me, so I'd really like to see what happened with 07. Are these both with or without a final rechunk? Does that spilling problem happen just at the end, or throughout? A few screenshots of the dashboard might even help until we figure out the performance report issue. |
Spilling typically happens towards the middle of the computation. Also, I should note that sometimes 2021.07.0 gets lucky and completes the computation. So I strongly suspect that there's something stochastic going on with task ordering, and the changes in the latest version of distributed bias towards a sub-optimal task order. Each layer of rechunking clusters the input tasks into clusters, and the clusters get wider (i.e., depend on more input tasks) as the rechunking requires more data convergence. So the ideal task ordering would be to schedule tasks such that these "rechunking clusters" are completed sequentially. Somehow the latest version of distributed is not finding this ideal task ordering, whereas older versions did OK. I will try to make a reproducer that doesn't depend on my own code / data. And I will try |
Thank you for your engagement here Davis. This helps a lot.
…On Tue, Jul 20, 2021 at 8:19 PM Davis Bennett ***@***.***> wrote:
Spilling typically happens towards the middle of the computation. Also, I
should note that sometimes 2021.07.0 gets lucky and completes the
computation. So I strongly suspect that there's something stochastic going
on with task ordering, and the changes in the latest version of distributed
bias towards a sub-optimal task order.
Each layer of rechunking clusters the input tasks into clusters, and the
clusters get wider (i.e., depend on more input tasks) as the rechunking
requires more data convergence. So the ideal task ordering would be to
schedule tasks such that these "rechunking clusters" are completed
sequentially. Somehow the latest version of distributed is not finding this
ideal task ordering, whereas older versions did OK.
I will try to make a reproducer that doesn't depend on my own code / data.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5083 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCJ62ENZ5YJTJAJ4VTTYYODFANCNFSM5AQOCA3Q>
.
|
reproducer: https://gist.github.com/d-v-b/f4cf44e42f4e9d9cfa2d109d2ad26500 performance reports: |
Brief update: @d-v-b I've run your reproducer locally and I'm seeing the same thing. I was also able to get non-broken performance reports: 2021.06.2, 2021.07.0. Downscaling to 2021.06.2.mp42021.07.0.mp4In the new version, you see two instances where a full vertical bar of tasks is red (in memory). This didn't happen before. This is probably what's blowing out the memory on the cluster, since the whole dataset is briefly fully in memory. Colored by worker assignmentI hacked up the task visualization to color tasks by worker (instead of status): 641f117. Note that tasks are already vertically ordered by priority. 2021.06.2-workercolor-larger.mp42021.07.0-workercolor-larger.mp4We can see that the worker assignment pattern is very different at the start (exactly as we'd expect). I'm not sure yet why this is happening. Because there's so much data shuffling involved, this is probably not a workload that benefits from running sibling tasks on the same worker. The question is why the task ordering feels different, when the changes in 2021.07.0 don't affect scheduling order at all, just worker selection. The old worker selection assigned tasks such that it was almost guaranteed that two root tasks with a common downstream task would be on different workers, leading to lots of transfers. My vague feeling is that maybe those transfers were acting as a sort of backpressure, which slowed down the |
While looking into dask#5083 I happened to notice that the dashboard felt very sluggish. I profiled with py-spy and discovered that the scheduler was spending 20% of runtime calculaing `sum(map(len, group._dependencies)) < 5`! A quick print statement showed some task groups depended on 25,728 other groups (each of size 1). We can easily skip those. I originally had this conditional in dask#4967 but we removed it for simplicity: dask#4967 (comment); turns out it was relevant after all!
@d-v-b as I mentioned, this seems to be completely fixed by #5113. 222 seconds with that PR vs 219 seconds on 2021.06.2 on my machine. The scheduling pattern even looks different; with 2021.07.0 it would blast through all the I'm a little surprised by this, since #5113 should be purely an optimization; it doesn't actually change the logic. Why did the scheduler being under load from those unnecessary Maybe the scheduler was too busy to tell workers to release keys, because either incoming or outgoing comms (or both) were blocked up by all this CPU time on the event loop? From the dashboard, I saw that all workers on 2021.07.0 had lots of tasks assigned (~800 per worker). So even with an absentee scheduler, they never ran out of things to do and kept piling up results, but rarely got the message to release these completed keys from memory? |
While looking into #5083 I happened to notice that the dashboard felt very sluggish. I profiled with py-spy and discovered that the scheduler was spending 20% of runtime calculaing `sum(map(len, group._dependencies)) < 5`! A quick print statement showed some task groups depended on 25,728 other groups (each of size 1). We can easily skip those. I originally had this conditional in #4967 but we removed it for simplicity: #4967 (comment); turns out it was relevant after all!
Great detective work @gjoseph92, thanks for tracking this down. I still need to test the timing / cluster load with the zarr synchronizer API + distributed lock idea, so maybe we keep this open until I have a chance to do that. |
OK this can be closed. I tested chunk locking the zarr synchronizer API vs rechunking in one of my typical workloads. I observed that the computation time was about the same for both, but there was much less transfer time with chunk locking, as expected. I do wonder if the docstring for |
Thanks Davis! 😄 Maybe we should open a new Dask issue with the documentation suggestions? |
dask.array.store
takes an optional argument,lock
, which (to my understanding) avoids write contention by forcing workers to request access to the write target before writing. But for chunked storage like zarr arrays, write contention happens at the level of individual chunks, not the entire array. So perhaps a lock for chunked writes should have the granularity of the chunk structure of the storage target, thereby allowing the scheduler to control access to individual chunks. Does this make sense?The context for this question is my attempt to optimize storing multiple enormous dask arrays in zarr containers with a minimum amount of rechunking, which makes the locking mechanism attractive, as long as the lock happens at the right place.
cc @jakirkham in case you have ideas about this.
The text was updated successfully, but these errors were encountered: