You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #6774, I manually went through the CI dashboard and identified taxonomies of test failures that were similar/the same across many tests. I have a feeling it would be valuable (and reasonable effort) to create a view that does this automatically, helping identify high-impact issues affecting CI.
Generally, there are probably 2 reasons for flaky tests:
An individual test is written in a way that's unreliable (too reliant on timing, actually causes a deadlock sometimes, etc.).
A bug in dask is causing something unrelated to the test to fail (timeout connecting to the cluster, asyncio error during cluster teardown, etc.). These tend to pop up in many unrelated tests. Because they can happen anywhere, they tend to blow up CI and the flaky test dashboard, and are probably responsible for the majority of failing tests.
I think we could identify #2 in a more automated way, just by creating another view on the test dashboard that groups failures by the failure message (like how OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s shows up in 13 different tests). There would need to be some fuzziness to this (an exact string match wouldn't work). But that visibility might help us to identify, prioritize, and fix the problems faster. It's also possible that these systematic problems would be more likely to affect users?
In #6774, I manually went through the CI dashboard and identified taxonomies of test failures that were similar/the same across many tests. I have a feeling it would be valuable (and reasonable effort) to create a view that does this automatically, helping identify high-impact issues affecting CI.
Generally, there are probably 2 reasons for flaky tests:
I think we could identify #2 in a more automated way, just by creating another view on the test dashboard that groups failures by the failure message (like how
OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s
shows up in 13 different tests). There would need to be some fuzziness to this (an exact string match wouldn't work). But that visibility might help us to identify, prioritize, and fix the problems faster. It's also possible that these systematic problems would be more likely to affect users?cc @ian-r-rose
The text was updated successfully, but these errors were encountered: