Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] [streaming] Fixes to autoscaling actor pool streaming op #32023

Merged
merged 9 commits into from
Jan 30, 2023

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Jan 28, 2023

Signed-off-by: Eric Liang ekhliang@gmail.com

Why are these changes needed?

Fixes:

  • Properly wire max tasks per actor to pool
  • Account for internal queue size in scheduling algorithm
  • Small improvements to progress bar UX

TODO:

  • Improve unit tests

Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
@ericl ericl changed the title [data] [streaming] Max tasks in flight arg not passed to autoscaling policy [WIP] [data] [streaming] Fixes to autoscaling actor pool streaming op Jan 28, 2023
Signed-off-by: Eric Liang <ekhliang@gmail.com>
@ericl ericl changed the title [WIP] [data] [streaming] Fixes to autoscaling actor pool streaming op [data] [streaming] Fixes to autoscaling actor pool streaming op Jan 28, 2023
Signed-off-by: Eric Liang <ekhliang@gmail.com>
@ericl ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 28, 2023
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just a few questions about the tests

@@ -68,7 +52,7 @@ def test_build_streaming_topology(ray_start_10_cpus_shared):
assert list(topo) == [o1, o2, o3]


def test_disallow_non_unique_operators(ray_start_10_cpus_shared):
def test_disallow_non_unique_operators():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is removing the ray_start_10_cpus_shared fixture necessary? Now the first test will implicitly start a cluster whose # of CPUs/workers will be dependent on the machine that its running on, and that cluster will be implicitly used for the rest of the tests in the module. If possible, we should really use fixtures that set the exact number of CPUs and explicitly manage the test-lifecycle of the clusters to make these tests more deterministic across machines and refactorings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split this test file into pure unit vs integration tests. Hence, they shouldn't depend on Ray and we shouldn't need a Ray fixture.

In general, it seems strange to use a fixture we don't need. We should either fix the fixture or split the tests into separate files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see the intention! But these tests will still need to start a Ray cluster because of the ray.puts for the input ref bundles and putting the MapOperator transform function into the object store, right? There just won't be any tasks launched.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's true. Maybe that was what was causing mysterious issues with the pipeline hang before? I also saw that sometimes, but it went away with the test split.

In any case, hopefully the puts will go away with the new logical backend.

python/ray/data/tests/test_streaming_integration.py Outdated Show resolved Hide resolved
with pytest.raises(ray.exceptions.RayTaskError):
ray.data.range(6, parallelism=6).map(
barrier3, compute=ray.data.ActorPoolStrategy(1, 2)
).take_all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test!

Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not going to block on the testing nits

Copy link
Contributor Author

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@ericl ericl removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 30, 2023
Signed-off-by: Eric Liang <ekhliang@gmail.com>
@ericl ericl merged commit 96440cf into ray-project:master Jan 30, 2023
@zhe-thoughts zhe-thoughts linked an issue Jan 30, 2023 that may be closed by this pull request
clarng pushed a commit to clarng/ray that referenced this pull request Jan 31, 2023
…project#32023)

Fixes:
- Properly wire max tasks per actor to pool
- Account for internal queue size in scheduling algorithm
- Small improvements to progress bar UX
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…project#32023)

Fixes:
- Properly wire max tasks per actor to pool
- Account for internal queue size in scheduling algorithm
- Small improvements to progress bar UX

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Ray 2.3 release] air_benchmark_xgboost_cpu_10 fails
5 participants