Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLA-1712] Make UnityEnvironment fail fast if the env crashes #4880

Merged
merged 6 commits into from
Jan 23, 2021

Conversation

chriselion
Copy link
Contributor

@chriselion chriselion commented Jan 22, 2021

Proposed change(s)

Currently, if the Unity process crashes hard (e.g. SIGABRT) without cleaning up, then UnityEnvironment will

  • Wait the full timeout (default 60 seconds) before stopping
  • Raise a "The Unity environment took too long to respond" exception

This was uncovered while looking into what happens if you use a CameraSensor with --no-graphics, but could happen in any other crash too.

The main change here is to break the main Connection.poll() call in RpcCommunicator.poll_for_timeout() into smaller poll(), check the subprocess status in between, and raise if the process exited. To avoid RpcCommunicator needing to know about SubprocessEnvironmentManager, this is done by passing an optional Callable to Communicator/RpcCommunicator methods.

Other smaller changes (I can break these into other PRs if desired):

  • Log an error in CameraSensor if the renderer is null.
  • Don't overwrite -logfile args if they're passed via --env-args

Old Behavior

2021-01-21 13:04:38 INFO [learn.py:269] run_seed set to 2742
CrashReporter: initialized
2021-01-21 13:04:39 INFO [environment.py:110] Connected to Unity environment with package version 1.7.2-preview and communication version 1.3.0
...
2021-01-21 13:05:39 INFO [subprocess_env_manager.py:186] UnityEnvironment worker 0: environment stopping.
2021-01-21 13:05:39 INFO [trainer_controller.py:81] Saved Model
2021-01-21 13:05:39 INFO [environment.py:407] Environment shut down with return code -6 (SIGABRT).
Traceback (most recent call last):
  File "/Users/chris.elion/code/ml-agents/venv/bin/mlagents-learn", line 33, in <module>
    sys.exit(load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')())
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/learn.py", line 274, in main
    run_cli(parse_command_line())
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/learn.py", line 270, in run_cli
    run_training(run_seed, options)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/learn.py", line 149, in run_training
    tc.start_learning(env_manager)
  File "/Users/chris.elion/code/ml-agents/ml-agents-envs/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 170, in start_learning
    self._reset_env(env_manager)
  File "/Users/chris.elion/code/ml-agents/ml-agents-envs/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 105, in _reset_env
    env_manager.reset(config=new_config)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/env_manager.py", line 67, in reset
    self.first_step_infos = self._reset_env(config)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/subprocess_env_manager.py", line 299, in _reset_env
    ew.previous_step = EnvironmentStep(ew.recv().payload, ew.worker_id, {}, {})
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/subprocess_env_manager.py", line 95, in recv
    raise env_exception
mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
	 The environment does not need user interaction to launch
	 The Agents' Behavior Parameters > Behavior Type is set to "Default"
	 The environment and the Python interface have compatible versions.

(elapsed time 60 seconds, inaccurate exception message)

New Behavior

2021-01-22 11:14:42 INFO [environment.py:110] Connected to Unity environment with package version 1.7.2-preview and communication version 1.3.0
2021-01-22 11:14:48 INFO [subprocess_env_manager.py:186] UnityEnvironment worker 0: environment stopping.
2021-01-22 11:14:48 INFO [trainer_controller.py:184] Learning was interrupted. Please wait while the graph is generated.
2021-01-22 11:14:48 INFO [trainer_controller.py:81] Saved Model
2021-01-22 11:14:48 INFO [environment.py:422] Environment shut down with return code -6 (SIGABRT).
Traceback (most recent call last):
  File "/Users/chris.elion/code/ml-agents/venv/bin/mlagents-learn", line 33, in <module>
    sys.exit(load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')())
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/learn.py", line 274, in main
    run_cli(parse_command_line())
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/learn.py", line 270, in run_cli
    run_training(run_seed, options)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/learn.py", line 149, in run_training
    tc.start_learning(env_manager)
  File "/Users/chris.elion/code/ml-agents/ml-agents-envs/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 194, in start_learning
    raise ex
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 170, in start_learning
    self._reset_env(env_manager)
  File "/Users/chris.elion/code/ml-agents/ml-agents-envs/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 105, in _reset_env
    env_manager.reset(config=new_config)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/env_manager.py", line 67, in reset
    self.first_step_infos = self._reset_env(config)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/subprocess_env_manager.py", line 299, in _reset_env
    ew.previous_step = EnvironmentStep(ew.recv().payload, ew.worker_id, {}, {})
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/subprocess_env_manager.py", line 95, in recv
    raise env_exception
mlagents_envs.exception.UnityEnvironmentException: Environment shut down with return code -6 (SIGABRT).

(elapsed time 6 seconds, better message)

Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)

https://jira.unity3d.com/browse/MLA-1712
https://jira.unity3d.com/browse/MLA-1713
(no jira for the -logfile overwriting)

Types of change(s)

  • Bug fix

Checklist

  • Added tests that prove my fix is effective or that my feature works
  • Updated the changelog (if applicable)

@@ -4,7 +4,7 @@
UnityConnectSettings:
m_ObjectHideFlags: 0
serializedVersion: 1
m_Enabled: 1
m_Enabled: 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automatically set during player build. Might as well commit it now so we don't have to keep undoing it.


# If the logfile arg isn't already set in the env args,
# try to set it to an output directory
logfile_set = "-logfile" in (arg.lower() for arg in self._additional_args)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was annoying - if you wanted to pass e.g. --env-args -logfile - to mlagents-learn it would get overwritten here.

poll_res = self._proc1.poll()
if poll_res is not None:
exc_msg = self._returncode_to_env_message(self._proc1.returncode)
raise UnityEnvironmentException(exc_msg)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we want to have separate exception messages for returncode 0 and non-0.

signal_name = f" ({signal_name})" if signal_name else ""
return_info = f"Environment shut down with return code {self._proc1.returncode}{signal_name}."
logger.info(return_info)
logger.info(self._returncode_to_env_message(self._proc1.returncode))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRY'ed this up, can revert if it's not worth it.

so that we can stop sooner and raise a more appropriate error.
"""
wait_time_remaining = self.timeout_wait
callback_timeout_wait = self.timeout_wait // 10
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make the "10" configurable, but seemed like a good rule of thumb.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't really care if the timeout is configurable, but if you do make it configurable please make sure to check it is less than self.timeout_wait

@@ -189,7 +189,7 @@ def _generate_all_results() -> AllStepResult:
)
_send_response(EnvironmentCommand.ENV_EXITED, ex)
except Exception as ex:
logger.error(
logger.exception(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will print the exception callstack along with the provided message.

@chriselion
Copy link
Contributor Author

Will add unit test and changelog this afternoon.

Copy link
Contributor

@hvpeteet hvpeteet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chris, looks great to me.

so that we can stop sooner and raise a more appropriate error.
"""
wait_time_remaining = self.timeout_wait
callback_timeout_wait = self.timeout_wait // 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't really care if the timeout is configurable, but if you do make it configurable please make sure to check it is less than self.timeout_wait

ml-agents-envs/mlagents_envs/rpc_communicator.py Outdated Show resolved Hide resolved
ml-agents-envs/mlagents_envs/environment.py Outdated Show resolved Hide resolved
@mock.patch.object(
mlagents_envs.rpc_communicator, "UnityToExternalServicerImplementation"
)
def test_rpc_communicator_initialize_OK(mock_impl, mock_grpc_server):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_rpc_communicator_initialize_OK and test_rpc_communicator_initialize_timeout passed with the old functionality too; test_rpc_communicator_initialize_callback is the new stuff.

@chriselion chriselion merged commit 30bcb01 into master Jan 23, 2021
@delete-merged-branch delete-merged-branch bot deleted the MLA-1712-no-graphics-crash branch January 23, 2021 01:22
@hyuDev
Copy link

hyuDev commented May 13, 2021

HI I got a error msg( mlagents_envs.exception.UnityEnvironmentException: Environment shut down with return code -6 (SIGABRT).)

the visualpushblock example runs macos but not running windows and ubuntu.
all setting is same, only gpu diffierent.
visual example run only cpu env?

@Unity-Technologies Unity-Technologies locked as resolved and limited conversation to collaborators May 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants