Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use spot + fallback + multizone for all clusters #514

Merged
merged 2 commits into from
Nov 16, 2022

Conversation

ntabris
Copy link
Member

@ntabris ntabris commented Nov 10, 2022

Fixes #507

(I also removed backend_option to send prometheus metrics since I now have that set at account-level for dask-engineering account.)

@ncclementi
Copy link
Contributor

@ntabris I merged #529 last night, would you mind merging main into your PR, and see if we get all green.

@fjetter You were the one with the most reservations regarding this, would you mind commenting/reviewing this?

@ncclementi
Copy link
Contributor

It looks like we have a lot of errors, is hard to tell for me, whether they are related to this PR or not, but they do not look good. Having the foundation team's opinion would be helpful here.

 ERROR tests/benchmarks/test_futures.py::test_large_map_first_work - AssertionError
ERROR tests/benchmarks/test_futures.py::test_memory_efficient - AssertionError
ERROR tests/benchmarks/test_array.py::test_basic_sum[slow-square] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_vorticity - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_double_diff - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_dot_product - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_map_overlap_sample - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_filter_then_average[50] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_filter_then_average[100] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_filter_then_average[200] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_filter_then_average[255] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_access_slices[700] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_access_slices[75] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_access_slices[1] - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s
ERROR tests/benchmarks/test_array.py::test_sum_residuals - OSError: Timed out trying to connect to tls://3.140.1.229:8786 after 30 s

@ntabris
Copy link
Member Author

ntabris commented Nov 16, 2022

It looks like we have a lot of errors, is hard to tell for me, whether they are related to this PR or not, but they do not look good. Having the foundation team's opinion would be helpful here.

They aren't related, that happened because there were a lot of instances already running in the OSS account and we just hit AWS availability issues.

I'm hoping this PR helps with that because (once it's merged / applies to all runs) the instances should be better distributed in the zones inside us-east-2.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ntabris. Given #531, and that this PR should help with instance availability errors, I think we should move forward with this change.

@fjetter I know you raised some concerns regarding this PR in #507 and I don't want to loose track of them. If you still have concerns about the specific changes here, do let me know and we can easily revert. If your concerns are not necessarily about the specific changes here, but more about a higher-level point on testing (e.g. having a staging environment for coiled-runtime) then let's have that conversation in a dedicated issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC: enable spot + fallback (+ multizone) for all tests
3 participants