Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: exit_on_unhandled_faults config option; attempt to recover (or exit) after too long a period of inactivity #150

Merged
merged 13 commits into from
Mar 6, 2024

Conversation

tazlin
Copy link
Member

@tazlin tazlin commented Mar 5, 2024

  • Sets the minimum required version of reGen to 4.2.4 due to the critical bug fixes it includes.
  • Fixes a critical bug which occurs when a LoRa takes too long to download and can make the worker unstable
  • Fixes LoRas retrying taking too long or retrying in doomed situations (such as missing a civitai token)
  • Adds a fail-safe zombie detection:
    • If all processes have done nothing (or no jobs were submitted) for the amount of time set by process_timeout (900 seconds by default) AND the last pop pop message was a valid job pop (and not a "no jobs" message)
      • If exit_on_unhandled_faults is true
        • The worker attempts to exit as quickly as possible, not reporting any faults.
      • If exit_on_unhandled_faults is false (the default)
        • All of the inference processes are restarted and it tries to recover.
  • Adds a good deal of logging to help diagnose future problems of this nature.

@tazlin tazlin changed the title feat: exit_on_unhandled_faults config option; exit after 30 mins inactivity feat: exit_on_unhandled_faults config option; exit after too long a period of inactivity Mar 5, 2024
@tazlin tazlin changed the title feat: exit_on_unhandled_faults config option; exit after too long a period of inactivity feat: exit_on_unhandled_faults config option; attempt to recover (or exit) after too long a period of inactivity Mar 5, 2024
@tazlin tazlin merged commit db0c626 into main Mar 6, 2024
3 checks passed
@tazlin tazlin deleted the exit-on-unhandled-failures branch March 6, 2024 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant