Rewrite of P2P control flow #7268

fjetter · 2022-11-08T11:11:41Z

This builds on #7195 which I was not able to finish because I ran into concurrency issues due to leaky tests. Particularly the global memory_limits on the MultiFile and MultiComm classes caused problems.

I went ahead and rewrote major parts of the concurrency model

MultiComm and MultiFile are now called CommShardsBuffer and DiskShardsBuffer respectively
Both Disk and Comm buffers inherit from a common base class ShardsBuffer that implements the concurrency and limiting of both classes
Both buffers are operating on the event loop such that there is no longer need for threading synchronization primitives.
To enable this, Shuffle needed to be rewritten slightly and uses another offload (specifically in Shuffle.add_partitions). We're still doing all the compute on a thread but it's a different thread now. The actual worker thread is mostly idle now.
No need for any queues any more to control concurrency. Instead, the buffers initialize N coroutines that run until the buffer is closed
Memory limiting is synchronized with a new ResourceLimiter primitive. This is a fairly simple class that allows one to acquire as much of a given resources as one wants but allows us to wait until the acquired resource drops below the specified level. This behavior represents our buffer usage pretty well. Blocking on acquire is not applicable since the memory is already in memory. However, after this acquire we'd like to wait for the memory to calm down before releasing again. This is effectively the same backpressure protocol as on main. This also allows for more flexible control over the buffer size (e.g. comm + disk buffer could have the same limit if we wanted to) and allows for much safer testing by now leaking global state
Last but not least, concurrency control is mostly implemented using Conditions instead of polling. That may be a subjectively a bit harder to read than the polling that's been there before but allows for significantly faster test runtimes.

I haven't touched serialization, worker dispatching, etc. This is all merely about control flow and responsibility. In follow up PRs I intend to modify actual behavior (e.g. fail on leaving workers)

cc @hendrikmakait @mrocklin

distributed/shuffle/_shuffle_extension.py

github-actions · 2022-11-08T11:59:30Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±    0       15 suites ±0 6h 30m 0s ⏱️ - 5m 44s
  3 213 tests +  44   3 128 ✔️ +  44   83 💤 - 1 2 ❌ +1
23 758 runs +310 22 856 ✔️ +314 900 💤 - 5 2 ❌ +1

For more details on these failures, see this check.

Results for commit d5320b9. ± Comparison against base commit 69a5709.

♻️ This comment has been updated with latest results.

mrocklin · 2022-11-08T14:08:47Z

As a heads-up I may not have time to review this during the next couple of days (prepping presentations, travelling, delivering presentations). The opening comment seems not-scary to me though. I wouldn't block on my review if you're feeling confident.

hendrikmakait

Generally, I like this change, the new separation of concerns looks like a great improvement. I have a few questions about minor parts of this PR as well as a number of small nits and suggestions.

hendrikmakait · 2022-11-08T16:41:35Z

distributed/shuffle/_buffer.py

+    sizes: defaultdict[str, int]
+    _exception: None | Exception
+
+    _queues: weakref.WeakKeyDictionary = weakref.WeakKeyDictionary()


From what I understand, we do not need this anymore.

hendrikmakait · 2022-11-08T16:42:49Z

distributed/shuffle/_buffer.py

+    _queues: weakref.WeakKeyDictionary = weakref.WeakKeyDictionary()
+
+    @property
+    def _instances(self) -> set:


Should we implement this or mark as # TODO?

Dead code. Should also be removed

hendrikmakait · 2022-11-08T16:45:07Z

distributed/shuffle/_buffer.py

+        return {
+            "memory": self.bytes_memory,
+            "total": self.bytes_total,
+            "buckets": len(self.shards),


Are self.shards shards or buckets?

shards is a mapping from buckets to shards. buckets are basically the partition IDs

Let's maybe rename self.shards to self.bucketed_shards or self.partitioned_shards to highlight that.

IMO this is OK. We're not naming every single mapping as key_value. Context is typically sufficient to infer this. In this case I think the shards mapping is fine and a more verbose attribute name make the code harder to read

Together with the new docstrings I agree this is now fine.

distributed/shuffle/_buffer.py

hendrikmakait · 2022-11-08T16:52:03Z

distributed/shuffle/_buffer.py

+            self.bytes_memory -= size
+
+    async def _process(self, id: str, shards: list[ShardType]) -> None:
+        raise NotImplementedError()


For my understanding: Do we (not) want to utilize abc.ABC to mark ShardsBuffer as an abstract base class?

What would be the benefit of this?

Marking ShardsBuffer as an abc.ABC as well as decorating _process with @abc.abstractmethod feels mildly cleaner since ShardsBuffer should not be initialized and any concrete subclass needs to implement _process[1]. This would allow mypy to catch errors such as trying to instantiate ShardsBuffer or a subclass that did not implement _process. Similarly, Python would throw somewhat more informative errors at runtime.

I guess this is mainly a question about conventions: If a class is an abstract class, should it always be marked as such or do we not care about that. If it is the latter, why do we not care? (We don't see any value in this case is a perfectly valid answer).

Feel free to shut this discussion down, the case at hand revolves around an abstract class that will have two concrete subclasses for the foreseeable future and is in a private module, so I'm rather interested in whether we have some conventions/guidelines around marking ABCs, not splitting hairs about this specific case.

[1] read not so much, which leads me off-trail to asking whether read belongs here in the class hierarchy, but that's a fringe problem that requires more changes and I don't want to deal with this right now.

This is a first shot. These classes will still change over time significantly and I'm not worried about anybody inheriting from this class since it's private.
If this is something we want to do we can do this later. This PR is definitely too big to have this conversation. I also don't want to block because of this

hendrikmakait · 2022-11-09T10:00:02Z

distributed/shuffle/_shuffle_extension.py

-    async def shuffle_inputs_done(self, comm: object, shuffle_id: ShuffleId) -> None:
+        await shuffle.receive(data)
+
+    async def shuffle_inputs_done(self, shuffle_id: ShuffleId) -> None:
        """
        Hander: Inform the extension that all input partitions have been handed off to extensions.


Driveby:

Suggested change

Hander: Inform the extension that all input partitions have been handed off to extensions.

Handler: Inform the extension that all input partitions have been handed off to extensions.

(same in L327)

distributed/shuffle/_buffer.py

hendrikmakait · 2022-11-09T10:24:23Z

distributed/shuffle/tests/test_shuffle.py

+
+    worker_for_mapping = {}
+
+    for part in range(npartitions):


nit for readability:

Suggested change

for part in range(npartitions):

for output_partition in range(npartitions):

(requires similar adjustments below)

hendrikmakait · 2022-11-09T10:28:17Z

distributed/shuffle/tests/test_shuffle.py

+            total_bytes_recvd += metrics["disk"]["total"]
+            total_bytes_recvd_shuffle += s.total_recvd
+
+        assert total_bytes_recvd_shuffle == total_bytes_sent


Suggested change

assert total_bytes_recvd_shuffle == total_bytes_sent

assert total_bytes_recvd_shuffle == total_bytes_sent == total_bytes_recvd

From what I understand, we probably want to test that as well.

these are not identical. there appears to be a small drift between sizes when comparing pyarrow buffers and bytes directly. Therefore, what the comms measure is slightly different from what is actually received. This is very ugly but I'm not willing to fix this right now. Refactoring the serialization parts is out of scope

Let's highlight this in the test then or drop total_bytes_recvd altogether. Right now it looks like we just forgot to do something useful with total_bytes_recvd.

distributed/shuffle/tests/test_shuffle.py

hendrikmakait · 2022-11-09T13:49:16Z

distributed/shuffle/_comms.py

+    queue: asyncio.Queue
+        A queue holding tokens used to limit concurrency


Suggested change

queue: asyncio.Queue

A queue holding tokens used to limit concurrency

hendrikmakait · 2022-11-09T16:19:49Z

distributed/shuffle/tests/test_shuffle.py

+            total_bytes_recvd += metrics["disk"]["total"]
+            total_bytes_recvd_shuffle += s.total_recvd
+
+        assert total_bytes_recvd_shuffle == total_bytes_sent


Let's highlight this in the test then or drop total_bytes_recvd altogether. Right now it looks like we just forgot to do something useful with total_bytes_recvd.

hendrikmakait · 2022-11-09T16:22:34Z

distributed/shuffle/_buffer.py

+        return {
+            "memory": self.bytes_memory,
+            "total": self.bytes_total,
+            "buckets": len(self.shards),


Let's maybe rename self.shards to self.bucketed_shards or self.partitioned_shards to highlight that.

hendrikmakait · 2022-11-09T16:50:13Z

distributed/shuffle/_buffer.py

+            self.bytes_memory -= size
+
+    async def _process(self, id: str, shards: list[ShardType]) -> None:
+        raise NotImplementedError()


Marking ShardsBuffer as an abc.ABC as well as decorating _process with @abc.abstractmethod feels mildly cleaner since ShardsBuffer should not be initialized and any concrete subclass needs to implement _process[1]. This would allow mypy to catch errors such as trying to instantiate ShardsBuffer or a subclass that did not implement _process. Similarly, Python would throw somewhat more informative errors at runtime.

I guess this is mainly a question about conventions: If a class is an abstract class, should it always be marked as such or do we not care about that. If it is the latter, why do we not care? (We don't see any value in this case is a perfectly valid answer).

Feel free to shut this discussion down, the case at hand revolves around an abstract class that will have two concrete subclasses for the foreseeable future and is in a private module, so I'm rather interested in whether we have some conventions/guidelines around marking ABCs, not splitting hairs about this specific case.

[1] read not so much, which leads me off-trail to asking whether read belongs here in the class hierarchy, but that's a fringe problem that requires more changes and I don't want to deal with this right now.

distributed/shuffle/_buffer.py

Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>

hendrikmakait

Thanks for adding the additional documentation, @fjetter. This looks good to me, my remaining questions/nits can be safely ignored.

hendrikmakait · 2022-11-10T14:23:05Z

distributed/shuffle/_buffer.py

+        return {
+            "memory": self.bytes_memory,
+            "total": self.bytes_total,
+            "buckets": len(self.shards),


Together with the new docstrings I agree this is now fine.

fjetter added 17 commits October 26, 2022 17:49

Push down receive to Shuffle

e7ea815

file close async

6ba5475

wait for threads to finish

11ab948

Remove explitict loop

6ae57f6

Shuffle SRP

d91556a

Ensure data is not lost if deserialization is slow

ffb3d08

fix comms test

3d547b0

Reproducer for multi file deadlock

c8d72f6

Ensure exceptions in multi file cannot deadlock

de1c81f

Do not allow read before flush

025dff8

rename condition

8e8556d

Ensure multicomm does not deadlock

62c018b

Patch memory_limit

fd15b6f

a lot more changes

faf2f45

more fixes

80212f7

Parametrize over big payload

132a1f8

Remove obsolete print

3ba30bc

fjetter commented Nov 8, 2022

View reviewed changes

distributed/shuffle/_shuffle_extension.py Outdated Show resolved Hide resolved

fjetter added 2 commits November 8, 2022 12:14

Rename buffers in test to not confuse pytest

92de184

Avoid importing pyarrow for disk buffer

f69b4ea

Fix dashboard

550ef02

hendrikmakait self-requested a review November 8, 2022 15:45

Renamings

81ec93a

fjetter self-assigned this Nov 8, 2022

hendrikmakait reviewed Nov 9, 2022

View reviewed changes

Additional notify_all

fc53dcf

hendrikmakait reviewed Nov 9, 2022

View reviewed changes

review comments

ad48eb3

hendrikmakait reviewed Nov 9, 2022

View reviewed changes

distributed/shuffle/_buffer.py Outdated Show resolved Hide resolved

distributed/shuffle/_buffer.py Show resolved Hide resolved

fjetter and others added 3 commits November 10, 2022 10:27

More review comments

93e699d

Update distributed/shuffle/_buffer.py

b7bfbde

Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>

More robust test_limiter_statistics

d5320b9

hendrikmakait approved these changes Nov 10, 2022

View reviewed changes

fjetter merged commit 9c6904d into dask:main Nov 10, 2022

fjetter deleted the shuffle_concurrency_issues branch November 10, 2022 14:40

fjetter mentioned this pull request Feb 27, 2023

P2P offload get_output_partition #7587

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite of P2P control flow #7268

Rewrite of P2P control flow #7268

fjetter commented Nov 8, 2022 •

edited

Loading

github-actions bot commented Nov 8, 2022 •

edited

Loading

mrocklin commented Nov 8, 2022

hendrikmakait left a comment

hendrikmakait Nov 8, 2022

hendrikmakait Nov 8, 2022

fjetter Nov 9, 2022

hendrikmakait Nov 8, 2022

fjetter Nov 9, 2022

hendrikmakait Nov 9, 2022

fjetter Nov 10, 2022

hendrikmakait Nov 10, 2022

hendrikmakait Nov 8, 2022

fjetter Nov 9, 2022

hendrikmakait Nov 9, 2022

fjetter Nov 10, 2022

hendrikmakait Nov 9, 2022

hendrikmakait Nov 9, 2022

hendrikmakait Nov 9, 2022

fjetter Nov 9, 2022

hendrikmakait Nov 9, 2022

hendrikmakait Nov 9, 2022

hendrikmakait Nov 9, 2022

hendrikmakait Nov 9, 2022

hendrikmakait Nov 9, 2022

hendrikmakait left a comment

hendrikmakait Nov 10, 2022

	Hander: Inform the extension that all input partitions have been handed off to extensions.
	Handler: Inform the extension that all input partitions have been handed off to extensions.

	for part in range(npartitions):
	for output_partition in range(npartitions):

	assert total_bytes_recvd_shuffle == total_bytes_sent
	assert total_bytes_recvd_shuffle == total_bytes_sent == total_bytes_recvd

		queue: asyncio.Queue
		A queue holding tokens used to limit concurrency

Rewrite of P2P control flow #7268

Rewrite of P2P control flow #7268

Conversation

fjetter commented Nov 8, 2022 • edited Loading

github-actions bot commented Nov 8, 2022 • edited Loading

Unit Test Results

mrocklin commented Nov 8, 2022

hendrikmakait left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented Nov 8, 2022 •

edited

Loading

github-actions bot commented Nov 8, 2022 •

edited

Loading