Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

[bg] Catch exceptions in workers #3575

Merged
merged 1 commit into from
Apr 12, 2021
Merged

[bg] Catch exceptions in workers #3575

merged 1 commit into from
Apr 12, 2021

Conversation

stephenroller
Copy link
Contributor

@stephenroller stephenroller commented Apr 8, 2021

Patch description
If exceptions happen in (all) background workers, currently training just hangs indefinitely while we wait for a batch from the queue. This patch adds a timeout, and checks background workers for exceptions when that timeout is hit. I had previously tested checking on these exceptions before pulling from the queue, but it was devastating to speed. This patch seems to strike a good balance.

In manual testing, this patch displayed these properties:

  • If some, but not all workers threw exceptions, then training continued with just fewer workers. Sometimes there was a print of an exception, but not always. Speed was similar to launching with an equivalent smaller --num-workers.
  • If all workers threw exceptions, then there was a LOT of log spew and the program ended. But the log spew did contain the root exception in it and thus a hint for the developer.
  • If no workers threw exceptions, speed was the approximately same as before this patch (within random variation on an 8 gpu 8 worker 10 minute run)

Testing steps
Considerable manual testing.

Copy link
Contributor

@EricMichaelSmith EricMichaelSmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, great!

@stephenroller stephenroller marked this pull request as ready for review April 12, 2021 20:41
@stephenroller stephenroller merged commit fe20239 into master Apr 12, 2021
@stephenroller stephenroller deleted the exc branch April 12, 2021 20:41
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants