Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart crashed Unity environments #5553

Merged
merged 7 commits into from
Oct 7, 2021

Conversation

hvpeteet
Copy link
Contributor

@hvpeteet hvpeteet commented Sep 28, 2021

Proposed change(s)

Update the SubprocessEnvManager to restart workers when the underlying Unity environments crash.
When a worker receives an ENV_EXITED signal it will now:

  1. Record all failures coming through the step queue and drop all other messages.
  2. Purge any pending trajectories as they may belong to a crashed worker or be corrupted.
  3. Restart all failed workers (up to a configurable limit)

This behavior can be limited via a rate limit, max lifetime limit, or both. The configuration options for both are shown below with their default values.

⚠️ Each of these options applies to a single environment, if num_envs > 1 then the limit will apply separately to each replica (num_envs = 2 will spawn 2 Unity environments which can each be restarted 10 times).

env_settings:
  # Can restart 10 times over the lifetime of the experiment.
  max_lifetime_restarts: 10
  # Rate limit of 1 failure per 60s
  restarts_rate_limit_n: 1
  restarts_rate_limit_period_s: 60

They can of course be passed via CLI arguments as well

--max-lifetime-restarts
--restarts-rate-limit-n
--restarts-rate-limit-period-s

Disabling this feature

  • Rate limiting can be turned off by setting --restarts-rate-limit-n=-1
  • Lifetime limiting can be turned off by setting --max-lifetime-restarts=-1

Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)

Internal JIRA: https://jira.unity3d.com/browse/MLA-1344

Types of change(s)

  • New feature

Checklist

  • Added tests that prove my fix is effective or that my feature works
  • Updated the changelog (if applicable)
  • Updated the documentation (if applicable)

@hvpeteet hvpeteet marked this pull request as ready for review September 28, 2021 21:43
@hvpeteet hvpeteet changed the title [WIP] Restart crashed Unity environments Restart crashed Unity environments Sep 28, 2021
@hvpeteet hvpeteet assigned hvpeteet and unassigned hvpeteet Sep 28, 2021
@hvpeteet hvpeteet requested review from miguelalonsojr and maryamhonari and removed request for vincentpierre September 29, 2021 20:35
Copy link
Collaborator

@miguelalonsojr miguelalonsojr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big fat green light!!!!

com.unity.ml-agents/CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@maryamhonari maryamhonari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants