-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use spot + fallback + multizone for all clusters #514
Conversation
It looks like we have a lot of errors, is hard to tell for me, whether they are related to this PR or not, but they do not look good. Having the foundation team's opinion would be helpful here. ERROR tests/benchmarks/test_futures.py::test_large_map_first_work - AssertionError
ERROR tests/benchmarks/test_futures.py::test_memory_efficient - AssertionError
ERROR tests/benchmarks/test_array.py::test_basic_sum[slow-square] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_vorticity - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_double_diff - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_dot_product - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_map_overlap_sample - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_filter_then_average[50] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_filter_then_average[100] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_filter_then_average[200] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_filter_then_average[255] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_access_slices[700] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_access_slices[75] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_access_slices[1] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_sum_residuals - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s |
They aren't related, that happened because there were a lot of instances already running in the OSS account and we just hit AWS availability issues. I'm hoping this PR helps with that because (once it's merged / applies to all runs) the instances should be better distributed in the zones inside |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ntabris. Given #531, and that this PR should help with instance availability errors, I think we should move forward with this change.
@fjetter I know you raised some concerns regarding this PR in #507 and I don't want to loose track of them. If you still have concerns about the specific changes here, do let me know and we can easily revert. If your concerns are not necessarily about the specific changes here, but more about a higher-level point on testing (e.g. having a staging environment for coiled-runtime
) then let's have that conversation in a dedicated issue
Fixes #507
(I also removed backend_option to send prometheus metrics since I now have that set at account-level for
dask-engineering
account.)