Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: enable spot + fallback (+ multizone) for all tests #507

Closed
ntabris opened this issue Nov 9, 2022 · 11 comments · Fixed by #514
Closed

RFC: enable spot + fallback (+ multizone) for all tests #507

ntabris opened this issue Nov 9, 2022 · 11 comments · Fixed by #514

Comments

@ntabris
Copy link
Member

ntabris commented Nov 9, 2022

For many tests, we use spot + fallback to on-demand. But not for all tests.

Here's counts I see for last seven days:

image

Any concern with enabling spot + fallback for all tests?

Also, I'd like to enable "multizone", which means we'd let AWS pick best AZ (still in us-east-2). This does open possibility of scheduler and workers in different AZ, which in my testing didn't appear to have performance impact (and shouldn't have cost impact given that total scheduler <-> worker traffic is small).

Any concerns about that?

I wouldn't expect any of this to impact perf once cluster is up (though I can't promise it won't), though it might have some (probably small) impact on time to spin up clusters.

@phobson
Copy link
Contributor

phobson commented Nov 9, 2022

This seems like a good move to me. IIUC, the longest benchmark tests are sound 4 to 6 minutes (test_shuffle), and that's already on spot + fallback.

@fjetter
Copy link
Member

fjetter commented Nov 11, 2022

I'm a bit on the fence about this proposed change since our CI is already pretty unreliable. I feel like spot + multizone introduces even more noise that would make failure more likely.

@ntabris
Copy link
Member Author

ntabris commented Nov 11, 2022

I'm not expecting this to affect reliability or perf of tests, if it did then we could revert. Is there a specific reason to be concerned it would affect things?

We're already using spot. My basic tests of scheduler and worker in different zones didn't affect perf as far as I could see (network is very good between zones), but I'm not completely confident that's true and to some extent what to use CI to make sure that's true at scale.

@dchudz
Copy link
Collaborator

dchudz commented Nov 13, 2022

Echoing Nat:

You're already using spot. We expect multi-zone to make it easier to get instances (spot or otherwise).

The only possible negative change that's anticipated is putting the scheduler and workers in different AZs (but still the same region!), which so far seems okay. It'll help us to understand better if that's OK.

How about a quick PR that makes the change? Then if we don't see problems in the tests on that PR, go ahead?

@ntabris
Copy link
Member Author

ntabris commented Nov 13, 2022

Quick PR is #514

@fjetter
Copy link
Member

fjetter commented Nov 14, 2022

My bigger point is that we should have an experimental/staging setup for tests like this. Otherwise we can hardly tell if any change made a difference and it makes it harder for us to do other things. This is basically our "prod" and you wouldn't roll our experiments on prod either, would you?

@dchudz
Copy link
Collaborator

dchudz commented Nov 14, 2022

If it's not, then I guess... You'd want to run both versions for a week or something? We can talk to James about getting that set up.

In the meantime is there any way to make changes like this? Could looking at data from a PR run be good enough?

@dchudz
Copy link
Collaborator

dchudz commented Nov 14, 2022

Also: the analogy seems bad. End users experience our prod.

These benchmarks are not experienced by end users. They're like our integration tests, not our prod.

@ntabris
Copy link
Member Author

ntabris commented Nov 14, 2022

Otherwise we can hardly tell if any change made a difference and it makes it harder for us to do other things

I think this is the main concern, and (just to be clear) I think it's a totally legitimate concern in general. I also don't think it's actually addressed by having a staging environment, since then you end up with multiple changes in your staging env (especially if verifying requires more than a single CI run) and you still have to do the work to determine root cause of regressions.

I'm going to address that concern for this specific case, but if it's a major concern then it could be worth having a large conversation about.

In this specific case:

  1. If there was impact, we'd expect to see it across dask/distributed versions, whereas code changes should be isolated to upstream.
  2. I've done some one-off testing of multizone and didn't see any concerns.
  3. This is super easy to revert if we do become concerned and need to isolate to determine root cause of a regression.

@ncclementi
Copy link
Contributor

I'm on the fence about multizone, as in multiple tests we do read data from s3 buckets and if the cluster is not in the same zone, this will damage performance and slow things down. I also think this will imply higher s3 costs, right?

@ntabris
Copy link
Member Author

ntabris commented Nov 14, 2022

if the cluster is not in the same zone, this will damage performance and slow things down. I also think this will imply higher s3 costs, right?

No, I think you're thinking about regions. Clusters would still all be in us-east-2, as are your S3 buckets.

(One can in theory use zone-specific S3 buckets, but I'm pretty sure we aren't.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants