-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: enable spot + fallback (+ multizone) for all tests #507
Comments
This seems like a good move to me. IIUC, the longest benchmark tests are sound 4 to 6 minutes (test_shuffle), and that's already on spot + fallback. |
I'm a bit on the fence about this proposed change since our CI is already pretty unreliable. I feel like spot + multizone introduces even more noise that would make failure more likely. |
I'm not expecting this to affect reliability or perf of tests, if it did then we could revert. Is there a specific reason to be concerned it would affect things? We're already using spot. My basic tests of scheduler and worker in different zones didn't affect perf as far as I could see (network is very good between zones), but I'm not completely confident that's true and to some extent what to use CI to make sure that's true at scale. |
Echoing Nat: You're already using spot. We expect multi-zone to make it easier to get instances (spot or otherwise). The only possible negative change that's anticipated is putting the scheduler and workers in different AZs (but still the same region!), which so far seems okay. It'll help us to understand better if that's OK. How about a quick PR that makes the change? Then if we don't see problems in the tests on that PR, go ahead? |
Quick PR is #514 |
My bigger point is that we should have an experimental/staging setup for tests like this. Otherwise we can hardly tell if any change made a difference and it makes it harder for us to do other things. This is basically our "prod" and you wouldn't roll our experiments on prod either, would you? |
If it's not, then I guess... You'd want to run both versions for a week or something? We can talk to James about getting that set up. In the meantime is there any way to make changes like this? Could looking at data from a PR run be good enough? |
Also: the analogy seems bad. End users experience our prod. These benchmarks are not experienced by end users. They're like our integration tests, not our prod. |
I think this is the main concern, and (just to be clear) I think it's a totally legitimate concern in general. I also don't think it's actually addressed by having a staging environment, since then you end up with multiple changes in your staging env (especially if verifying requires more than a single CI run) and you still have to do the work to determine root cause of regressions. I'm going to address that concern for this specific case, but if it's a major concern then it could be worth having a large conversation about. In this specific case:
|
I'm on the fence about multizone, as in multiple tests we do read data from s3 buckets and if the cluster is not in the same zone, this will damage performance and slow things down. I also think this will imply higher s3 costs, right? |
No, I think you're thinking about regions. Clusters would still all be in (One can in theory use zone-specific S3 buckets, but I'm pretty sure we aren't.) |
For many tests, we use spot + fallback to on-demand. But not for all tests.
Here's counts I see for last seven days:
Any concern with enabling spot + fallback for all tests?
Also, I'd like to enable "multizone", which means we'd let AWS pick best AZ (still in
us-east-2
). This does open possibility of scheduler and workers in different AZ, which in my testing didn't appear to have performance impact (and shouldn't have cost impact given that total scheduler <-> worker traffic is small).Any concerns about that?
I wouldn't expect any of this to impact perf once cluster is up (though I can't promise it won't), though it might have some (probably small) impact on time to spin up clusters.
The text was updated successfully, but these errors were encountered: