-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] [streaming] Fixes to autoscaling actor pool streaming op #32023
Conversation
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just a few questions about the tests
@@ -68,7 +52,7 @@ def test_build_streaming_topology(ray_start_10_cpus_shared): | |||
assert list(topo) == [o1, o2, o3] | |||
|
|||
|
|||
def test_disallow_non_unique_operators(ray_start_10_cpus_shared): | |||
def test_disallow_non_unique_operators(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is removing the ray_start_10_cpus_shared
fixture necessary? Now the first test will implicitly start a cluster whose # of CPUs/workers will be dependent on the machine that its running on, and that cluster will be implicitly used for the rest of the tests in the module. If possible, we should really use fixtures that set the exact number of CPUs and explicitly manage the test-lifecycle of the clusters to make these tests more deterministic across machines and refactorings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split this test file into pure unit vs integration tests. Hence, they shouldn't depend on Ray and we shouldn't need a Ray fixture.
In general, it seems strange to use a fixture we don't need. We should either fix the fixture or split the tests into separate files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see the intention! But these tests will still need to start a Ray cluster because of the ray.put
s for the input ref bundles and putting the MapOperator
transform function into the object store, right? There just won't be any tasks launched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's true. Maybe that was what was causing mysterious issues with the pipeline hang before? I also saw that sometimes, but it went away with the test split.
In any case, hopefully the puts will go away with the new logical backend.
with pytest.raises(ray.exceptions.RayTaskError): | ||
ray.data.range(6, parallelism=6).map( | ||
barrier3, compute=ray.data.ActorPoolStrategy(1, 2) | ||
).take_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test!
Signed-off-by: Eric Liang <ekhliang@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not going to block on the testing nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
…project#32023) Fixes: - Properly wire max tasks per actor to pool - Account for internal queue size in scheduling algorithm - Small improvements to progress bar UX
…project#32023) Fixes: - Properly wire max tasks per actor to pool - Account for internal queue size in scheduling algorithm - Small improvements to progress bar UX Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Eric Liang ekhliang@gmail.com
Why are these changes needed?
Fixes:
TODO: