Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_work_stealing_on_scaling_up failing #336

Closed
ian-r-rose opened this issue Sep 15, 2022 · 7 comments · Fixed by #351
Closed

test_work_stealing_on_scaling_up failing #336

ian-r-rose opened this issue Sep 15, 2022 · 7 comments · Fixed by #351

Comments

@ian-r-rose
Copy link
Contributor

#308 was merged a few days ago. It appears to be skipped for 2022.6.0 (cf #306), but it runs on upstream.
I have seen it fail a number of times with killed workers:

https://github.com/coiled/coiled-runtime/actions/runs/3062687464/jobs/4943938999
https://github.com/coiled/coiled-runtime/actions/runs/3062426988/jobs/4943389537

We may want to skip it for the time being. cc @hendrikmakait

@fjetter
Copy link
Member

fjetter commented Sep 16, 2022

I'd like to point out that this is not a flakiness problem since it now fails very reliably. There appears to be a problem with the software environments

  +-------------+-----------------------+-----------+----------+
  | Package     | client                | scheduler | workers  |
  +-------------+-----------------------+-----------+----------+
  | dask        | 2022.9.0+22.g982376ef | 2022.6.0  | 2022.6.0 |
  | distributed | 2022.9.0+13.g1fd07f03 | 2022.6.0  | 2022.6.0 |
  +-------------+-----------------------+-----------+----------+

Did this fail before package_sync as well? The CI jobs you are pointing to are using package_sync.

@fjetter
Copy link
Member

fjetter commented Sep 16, 2022

This test is using cluster.scale to upscale an existing cluster. @shughes-uk is this supported with package sync?

@fjetter fjetter changed the title test_work_stealing_on_scaling_up flaky test_work_stealing_on_scaling_up failing Sep 16, 2022
@hendrikmakait
Copy link
Member

hendrikmakait commented Sep 16, 2022

Note the difference between the mismatch on CI

  +-------------+-----------------------+-----------+----------+
  | Package     | client                | scheduler | workers  |
  +-------------+-----------------------+-----------+----------+
  | dask        | 2022.9.0+25.g1a8533fd | 2022.6.0  | 2022.6.0 |
  | distributed | 2022.9.0+13.g1fd07f03 | 2022.6.0  | 2022.6.0 |
  +-------------+-----------------------+-----------+----------+

and the mismatch for worker i-04790bd562d506836 on the corresponding cluster

Sep 16 00:34:42 ip-10-0-11-96 cloud-init[1125]: +-------------+-------------+-----------+---------------------------------------+
Sep 16 00:34:42 ip-10-0-11-96 cloud-init[1125]: | Package     | This Worker | scheduler | workers                               |
Sep 16 00:34:42 ip-10-0-11-96 cloud-init[1125]: +-------------+-------------+-----------+---------------------------------------+
Sep 16 00:34:42 ip-10-0-11-96 cloud-init[1125]: | dask        | 2022.6.0    | 2022.6.0  | {'2022.6.0', '2022.9.0+25.g1a8533fd'} |
Sep 16 00:34:42 ip-10-0-11-96 cloud-init[1125]: | distributed | 2022.6.0    | 2022.6.0  | {'2022.6.0', '2022.9.0+13.g1fd07f03'} |
Sep 16 00:34:42 ip-10-0-11-96 cloud-init[1125]: +-------------+-------------+-----------+---------------------------------------+

@ntabris
Copy link
Member

ntabris commented Sep 16, 2022

possibly same problem as https://github.com/coiled/platform/issues/84

@ian-r-rose
Copy link
Contributor Author

Thanks @ntabris, is this deployed yet? If not, I'd propose skipping this test until then.

@ntabris
Copy link
Member

ntabris commented Sep 16, 2022

is this deployed yet?

No, we were waiting on pyzmq wheels for linux x86 which are now out. I'd expect Coiled release today or Monday.

@shughes-uk
Copy link
Contributor

Should be deployed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants