-
Notifications
You must be signed in to change notification settings - Fork 7k
[data] Unify bundle queue abstraction #59093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[data] Unify bundle queue abstraction #59093
Conversation
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a BaseBundleQueue abstraction to unify the various queue implementations within Ray Data operators. This is a significant and valuable refactoring that simplifies operator design and centralizes queue metrics. The overall approach is solid, and the new queue implementations in fifo.py, hash_link.py, and ordered.py are well-designed.
However, I've found several issues with the implementation that need to be addressed:
- There are a few critical correctness bugs, including a missing import in
bundle_queue/__init__.pyand incorrect buffer handling inOutputSplitter. - The rebundling queues (
RebundleQueueandExactRebundleQueue) have issues: one violates thepeek_nextcontract by having side effects, and the other has a metric leak inget_next_with_original. - The transition to queue-based metrics is incomplete in
MapOperator, leading to redundant and confusing metric tracking. - There are also some minor issues with incorrect type hints and misleading docstrings.
I've left detailed comments on each of these points. Addressing them will ensure the new abstraction is robust and correctly implemented.
python/ray/data/_internal/execution/bundle_queue/bundler_exact.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/execution/operators/output_splitter.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/execution/bundle_queue/bundler_exact.py
Outdated
Show resolved
Hide resolved
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
omatthew98
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly left comments surrounding the high level method names / APIs. I think solid start but some opportunities for tightening the naming up since we are already doing a big shake up. LMK if you need another review.
| @@ -1,75 +0,0 @@ | |||
| import abc | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to fully scrap BundleQueue in favor of BaseBundleQueue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's mostly a rename into the base.py (i'm gonna make clear what is renamed and what is not after this comment), the only difference is that the base also contains a method called done_adding_bundles. This method is used for BlockRefBundler, StreamingRepartitionRefBunlder, and OrderedOutputQueue
| """Protocol for storing bundles AND supporting remove(bundle) | ||
| and contains(bundle) operations quickly.""" | ||
|
|
||
| def __contains__(self, bundle: RefBundle) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is __contains__ not supported for the other queues? Should we maybe specify in the comment that this should both implement the method and be efficient or there might be degraded performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea it can be supported for most queues, but I opted to not move it into the base implementation because
- our existing code infrequently uses
__contains__ BlockRefBundlerandStreamingRefBundlerhave weird semantics for__contains__, because one moment the bundle is there, and the next it's gone due to "rebundling".
On that note of bullet # 2, rather than haveSupportsRebundlingbe a protocol, we can have it becomeSupportsRebundling(BaseBundleQueue), which only supports rebundling, and won't contain a__contains__method. Haven't thought too deeply into this approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok yeah the current approach seems reasonable to me then. I think fine to keep the protocols as is rather than creating a more piecemeal approach where we have very tiny protocols.
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
2ddb01b to
1a58f57
Compare
| from ray.data._internal.execution.interfaces import RefBundle | ||
|
|
||
|
|
||
| class RebundleQueue(BaseBundleQueue, SupportsRebundling): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is moved from map_operator.py. However, I augmented this class with peek.*, get_next_with_original, and clear methods below. The previous done_adding_bundles has been renamed to finalized
|
|
||
|
|
||
| class StreamingRepartitionRefBundler(BaseRefBundler): | ||
| class ExactRebundleQueue(BaseBundleQueue, SupportsRebundling): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class was also moved from streaming_repartition.py. However, I augmented this class with get_next_with_original and clear methods below. The previous done_adding_bundles has been renamed to finalized
| from ray.data._internal.execution.interfaces import RefBundle | ||
|
|
||
|
|
||
| class FIFOBundleQueue(BaseBundleQueue, SupportsDeque): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is basically a wrapper around deque([]) of ref bundles. It is NOT the original FIFOBundleQueue. The original FIFOBundleQueue has been renamed to HashLinkedQueue
|
|
||
| class FIFOBundleQueue(BundleQueue): | ||
| """A bundle queue that follows a first-in-first-out policy.""" | ||
| class HashLinkedQueue(BaseBundleQueue, SupportsIndexing, SupportsDeque): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed from FIFOBundleQueue. I added peek_last, get_last, and add_to_front methods, as specified by the SupportsDeque protocol.
| from ray.data._internal.execution.interfaces import RefBundle | ||
|
|
||
|
|
||
| class OrderedBundleQueue(BaseBundleQueue): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class was moved from map_operator.py
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
6c2d56b to
6c2e17b
Compare
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
omatthew98
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for churning those comments, one minor question but looks super solid.
| that the # of blocks, rows, and bytes must be in remain unchanged | ||
| before and after this method call. | ||
| If queue.has_next() == False, return `None`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should raise a ValueError to differentiate between head of the queue is None and queue is empty like get_next? Not sure if this would be a problem but if we guard both get_next and peek_next with has_next that seems sane and consistent to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
between head of the queue is None and queue is empty like get_next?
Wait, am i correct in understanding that some bundles are flat out None? I think ideally I would only store non None bundles.
if we guard both get_next and peek_next with has_next that seems sane and consistent to me?
I'm not entirely following, are you referring to a specific sub class? These are the semantics in my head
- has_next()==True =>
get_next()andpeek_next()areRefBundle - has_next()==False =>
get_next()raise error, andpeek_next()isNone - it's not possible for
get_next()to returnNone
Description
Currently we support many types of bundle queues inside operators:
BlockRefBundlerFIFOBundleQueueUnorderedOutputQueueOrderedOutputQueueList[RefBundle]Deque[RefBundle]StreamingRepartitionRefBundlerAll these independent classes makes it very awkward to record queue metrics within operators. In fact, operators themselves have to manage it themselves. This PR introduces a base class called
BaseBundleQueuethat hopes to unify all the independent classes together.With this approach, I hope to
InternalQueueOperatorMixin, that will automatically get size of queues, as well as clearing queues for free in the mixinBaseBundleQueuethat storesRefBundles.Future Work
This PR is a stepping stone to completely refractor how metrics are collected, stored, and exported. I'd like to refractor
OpRuntimeMetricsbecause it assumes that all operators areMapOperators, which isn't true. After this PR, I have a few proposals to refractorOpRuntimeMetricsBaseOpMetricsclass, so that individual operators can overrideBaseOpMetrics. For operators with internal queues, we can automatically hook inQueueOpMetrics(BaseOpMetrics)too_StatsActordirectly through a something like aMetricsRegistryinterface. This will require a lot of work, and I will have to repurpose the_StatsActorto read from that Registry._StatsActorand_StatsManager, but I'm not sure if that is very clean.Update on 12/2
Here is a summary of extensions

Tests
TBD
Related issues
None
Additional information
None