Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine performance metrics: Meter wasted partial compute time after losing a worker #7678

Open
crusaderky opened this issue Mar 17, 2023 · 0 comments

Comments

@crusaderky
Copy link
Collaborator

When we lose a worker, any time spent partially executing a task is lost. The task transitions back from processing to released on the scheduler, so that it may be executed somewhere else.

When the scheduler receives metrics from the heartbeat (#7666), it normally immediately forgets which worker they come from. It should make an execption for currently-running (#7677) and keep track of which worker they come from.

When a worker dies, it should subtract all currently-running time for the worker and reclassify it as ("execute", <prefix>, "killed-worker", "seconds").

Additionally, it should add to this measure all tasks that were currently executing as of the last received heartbeat, minus those that completed in the meantime, multiplied by the time between the last heartbeat and the worker death.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant