RFC: enable spot + fallback (+ multizone) for all tests #507

ntabris · 2022-11-09T15:24:10Z

For many tests, we use spot + fallback to on-demand. But not for all tests.

Here's counts I see for last seven days:

Any concern with enabling spot + fallback for all tests?

Also, I'd like to enable "multizone", which means we'd let AWS pick best AZ (still in us-east-2). This does open possibility of scheduler and workers in different AZ, which in my testing didn't appear to have performance impact (and shouldn't have cost impact given that total scheduler <-> worker traffic is small).

Any concerns about that?

I wouldn't expect any of this to impact perf once cluster is up (though I can't promise it won't), though it might have some (probably small) impact on time to spin up clusters.

The text was updated successfully, but these errors were encountered:

phobson · 2022-11-09T16:43:31Z

This seems like a good move to me. IIUC, the longest benchmark tests are sound 4 to 6 minutes (test_shuffle), and that's already on spot + fallback.

fjetter · 2022-11-11T10:22:08Z

I'm a bit on the fence about this proposed change since our CI is already pretty unreliable. I feel like spot + multizone introduces even more noise that would make failure more likely.

ntabris · 2022-11-11T12:40:32Z

I'm not expecting this to affect reliability or perf of tests, if it did then we could revert. Is there a specific reason to be concerned it would affect things?

We're already using spot. My basic tests of scheduler and worker in different zones didn't affect perf as far as I could see (network is very good between zones), but I'm not completely confident that's true and to some extent what to use CI to make sure that's true at scale.

dchudz · 2022-11-13T14:40:37Z

Echoing Nat:

You're already using spot. We expect multi-zone to make it easier to get instances (spot or otherwise).

The only possible negative change that's anticipated is putting the scheduler and workers in different AZs (but still the same region!), which so far seems okay. It'll help us to understand better if that's OK.

How about a quick PR that makes the change? Then if we don't see problems in the tests on that PR, go ahead?

ntabris · 2022-11-13T16:15:08Z

Quick PR is #514

fjetter · 2022-11-14T09:17:48Z

My bigger point is that we should have an experimental/staging setup for tests like this. Otherwise we can hardly tell if any change made a difference and it makes it harder for us to do other things. This is basically our "prod" and you wouldn't roll our experiments on prod either, would you?

dchudz · 2022-11-14T11:51:26Z

If it's not, then I guess... You'd want to run both versions for a week or something? We can talk to James about getting that set up.

In the meantime is there any way to make changes like this? Could looking at data from a PR run be good enough?

dchudz · 2022-11-14T11:53:03Z

Also: the analogy seems bad. End users experience our prod.

These benchmarks are not experienced by end users. They're like our integration tests, not our prod.

ntabris · 2022-11-14T15:17:19Z

Otherwise we can hardly tell if any change made a difference and it makes it harder for us to do other things

I think this is the main concern, and (just to be clear) I think it's a totally legitimate concern in general. I also don't think it's actually addressed by having a staging environment, since then you end up with multiple changes in your staging env (especially if verifying requires more than a single CI run) and you still have to do the work to determine root cause of regressions.

I'm going to address that concern for this specific case, but if it's a major concern then it could be worth having a large conversation about.

In this specific case:

If there was impact, we'd expect to see it across dask/distributed versions, whereas code changes should be isolated to upstream.
I've done some one-off testing of multizone and didn't see any concerns.
This is super easy to revert if we do become concerned and need to isolate to determine root cause of a regression.

ncclementi · 2022-11-14T17:46:58Z

I'm on the fence about multizone, as in multiple tests we do read data from s3 buckets and if the cluster is not in the same zone, this will damage performance and slow things down. I also think this will imply higher s3 costs, right?

ntabris · 2022-11-14T17:52:18Z

if the cluster is not in the same zone, this will damage performance and slow things down. I also think this will imply higher s3 costs, right?

No, I think you're thinking about regions. Clusters would still all be in us-east-2, as are your S3 buckets.

(One can in theory use zone-specific S3 buckets, but I'm pretty sure we aren't.)

ntabris mentioned this issue Nov 10, 2022

Use spot + fallback + multizone for all clusters #514

Merged

ntabris mentioned this issue Nov 16, 2022

aws instance availability errors #531

Closed

jrbourbeau closed this as completed in #514 Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: enable spot + fallback (+ multizone) for all tests #507

RFC: enable spot + fallback (+ multizone) for all tests #507

ntabris commented Nov 9, 2022

phobson commented Nov 9, 2022

fjetter commented Nov 11, 2022

ntabris commented Nov 11, 2022

dchudz commented Nov 13, 2022

ntabris commented Nov 13, 2022

fjetter commented Nov 14, 2022

dchudz commented Nov 14, 2022

dchudz commented Nov 14, 2022

ntabris commented Nov 14, 2022

ncclementi commented Nov 14, 2022

ntabris commented Nov 14, 2022

RFC: enable spot + fallback (+ multizone) for all tests #507

RFC: enable spot + fallback (+ multizone) for all tests #507

Comments

ntabris commented Nov 9, 2022

phobson commented Nov 9, 2022

fjetter commented Nov 11, 2022

ntabris commented Nov 11, 2022

dchudz commented Nov 13, 2022

ntabris commented Nov 13, 2022

fjetter commented Nov 14, 2022

dchudz commented Nov 14, 2022

dchudz commented Nov 14, 2022

ntabris commented Nov 14, 2022

ncclementi commented Nov 14, 2022

ntabris commented Nov 14, 2022