You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the past months, we've seen occasional deadlock issues connected to the scheduler/worker state machine. To debug this, I wrote a small script which essentially fetches and serializes the entire state of a stuck cluster (see below). Think Scheduler.story + Scheduler.get_logs + much more for everything.
The script is very crude and tries to get the information without trying to properly serialize all objects. The emphasize was rather to have a is fully json serializable representation of the cluster to allow for easier portability, formatting, etc.
I wanted to preserve this for prosperity in case this helps anyone.
Further, I am wondering if we wanted to have such a functionality as a first class citizen. Think of Client.collect_cluster_state (Note: this may create several GBs worth of data). The client function would then try to do a better job. Sometimes it would be sufficient to extend or modify existing identity method but much more thoroughly, of course. If I were to implement this properly, I would likely start with attaching a to_dict method to all our classes which yields a json/yaml serializable representation (including task state, worker state, etc.)
Thoughts? Would people consider this helpful?
Disclaimer
the script strips code, traceback, and exceptions from all output such that there should be no IP leak but I don't guarantee anything if there are sensitive logs, task keys, IP addresses, etc.
In the past months, we've seen occasional deadlock issues connected to the scheduler/worker state machine. To debug this, I wrote a small script which essentially fetches and serializes the entire state of a stuck cluster (see below). Think
Scheduler.story
+Scheduler.get_logs
+much more
for everything.The script is very crude and tries to get the information without trying to properly serialize all objects. The emphasize was rather to have a is fully json serializable representation of the cluster to allow for easier portability, formatting, etc.
I wanted to preserve this for prosperity in case this helps anyone.
Further, I am wondering if we wanted to have such a functionality as a first class citizen. Think of
Client.collect_cluster_state
(Note: this may create several GBs worth of data). The client function would then try to do a better job. Sometimes it would be sufficient to extend or modify existingidentity
method but much more thoroughly, of course. If I were to implement this properly, I would likely start with attaching ato_dict
method to all our classes which yields a json/yaml serializable representation (including task state, worker state, etc.)Thoughts? Would people consider this helpful?
Disclaimer
the script strips code, traceback, and exceptions from all output such that there should be no IP leak but I don't guarantee anything if there are sensitive logs, task keys, IP addresses, etc.
Last script update: 2021-09-29
The text was updated successfully, but these errors were encountered: