[MLA-1712] Make UnityEnvironment fail fast if the env crashes #4880

chriselion · 2021-01-22T19:24:15Z

Proposed change(s)

Currently, if the Unity process crashes hard (e.g. SIGABRT) without cleaning up, then UnityEnvironment will

Wait the full timeout (default 60 seconds) before stopping
Raise a "The Unity environment took too long to respond" exception

This was uncovered while looking into what happens if you use a CameraSensor with --no-graphics, but could happen in any other crash too.

The main change here is to break the main Connection.poll() call in RpcCommunicator.poll_for_timeout() into smaller poll(), check the subprocess status in between, and raise if the process exited. To avoid RpcCommunicator needing to know about SubprocessEnvironmentManager, this is done by passing an optional Callable to Communicator/RpcCommunicator methods.

Other smaller changes (I can break these into other PRs if desired):

Log an error in CameraSensor if the renderer is null.
Don't overwrite -logfile args if they're passed via --env-args

Old Behavior

2021-01-21 13:04:38 INFO [learn.py:269] run_seed set to 2742
CrashReporter: initialized
2021-01-21 13:04:39 INFO [environment.py:110] Connected to Unity environment with package version 1.7.2-preview and communication version 1.3.0
...
2021-01-21 13:05:39 INFO [subprocess_env_manager.py:186] UnityEnvironment worker 0: environment stopping.
2021-01-21 13:05:39 INFO [trainer_controller.py:81] Saved Model
2021-01-21 13:05:39 INFO [environment.py:407] Environment shut down with return code -6 (SIGABRT).
Traceback (most recent call last):
  File "/Users/chris.elion/code/ml-agents/venv/bin/mlagents-learn", line 33, in <module>
    sys.exit(load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')())
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/learn.py", line 274, in main
    run_cli(parse_command_line())
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/learn.py", line 270, in run_cli
    run_training(run_seed, options)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/learn.py", line 149, in run_training
    tc.start_learning(env_manager)
  File "/Users/chris.elion/code/ml-agents/ml-agents-envs/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 170, in start_learning
    self._reset_env(env_manager)
  File "/Users/chris.elion/code/ml-agents/ml-agents-envs/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 105, in _reset_env
    env_manager.reset(config=new_config)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/env_manager.py", line 67, in reset
    self.first_step_infos = self._reset_env(config)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/subprocess_env_manager.py", line 299, in _reset_env
    ew.previous_step = EnvironmentStep(ew.recv().payload, ew.worker_id, {}, {})
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/subprocess_env_manager.py", line 95, in recv
    raise env_exception
mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
	 The environment does not need user interaction to launch
	 The Agents' Behavior Parameters > Behavior Type is set to "Default"
	 The environment and the Python interface have compatible versions.

(elapsed time 60 seconds, inaccurate exception message)

New Behavior

2021-01-22 11:14:42 INFO [environment.py:110] Connected to Unity environment with package version 1.7.2-preview and communication version 1.3.0
2021-01-22 11:14:48 INFO [subprocess_env_manager.py:186] UnityEnvironment worker 0: environment stopping.
2021-01-22 11:14:48 INFO [trainer_controller.py:184] Learning was interrupted. Please wait while the graph is generated.
2021-01-22 11:14:48 INFO [trainer_controller.py:81] Saved Model
2021-01-22 11:14:48 INFO [environment.py:422] Environment shut down with return code -6 (SIGABRT).
Traceback (most recent call last):
  File "/Users/chris.elion/code/ml-agents/venv/bin/mlagents-learn", line 33, in <module>
    sys.exit(load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')())
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/learn.py", line 274, in main
    run_cli(parse_command_line())
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/learn.py", line 270, in run_cli
    run_training(run_seed, options)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/learn.py", line 149, in run_training
    tc.start_learning(env_manager)
  File "/Users/chris.elion/code/ml-agents/ml-agents-envs/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 194, in start_learning
    raise ex
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 170, in start_learning
    self._reset_env(env_manager)
  File "/Users/chris.elion/code/ml-agents/ml-agents-envs/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 105, in _reset_env
    env_manager.reset(config=new_config)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/env_manager.py", line 67, in reset
    self.first_step_infos = self._reset_env(config)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/subprocess_env_manager.py", line 299, in _reset_env
    ew.previous_step = EnvironmentStep(ew.recv().payload, ew.worker_id, {}, {})
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/subprocess_env_manager.py", line 95, in recv
    raise env_exception
mlagents_envs.exception.UnityEnvironmentException: Environment shut down with return code -6 (SIGABRT).

(elapsed time 6 seconds, better message)

Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)

https://jira.unity3d.com/browse/MLA-1712
https://jira.unity3d.com/browse/MLA-1713
(no jira for the -logfile overwriting)

Types of change(s)

Bug fix

Checklist

Added tests that prove my fix is effective or that my feature works
Updated the changelog (if applicable)

chriselion · 2021-01-22T19:26:44Z

Project/ProjectSettings/UnityConnectSettings.asset

@@ -4,7 +4,7 @@
 UnityConnectSettings:
  m_ObjectHideFlags: 0
  serializedVersion: 1
-  m_Enabled: 1
+  m_Enabled: 0


Automatically set during player build. Might as well commit it now so we don't have to keep undoing it.

chriselion · 2021-01-22T19:28:29Z

ml-agents-envs/mlagents_envs/environment.py

+
+        # If the logfile arg isn't already set in the env args,
+        # try to set it to an output directory
+        logfile_set = "-logfile" in (arg.lower() for arg in self._additional_args)


This was annoying - if you wanted to pass e.g. --env-args -logfile - to mlagents-learn it would get overwritten here.

ml-agents-envs/mlagents_envs/environment.py

chriselion · 2021-01-22T19:30:17Z

ml-agents-envs/mlagents_envs/environment.py

+        poll_res = self._proc1.poll()
+        if poll_res is not None:
+            exc_msg = self._returncode_to_env_message(self._proc1.returncode)
+            raise UnityEnvironmentException(exc_msg)


Not sure if we want to have separate exception messages for returncode 0 and non-0.

chriselion · 2021-01-22T19:30:56Z

ml-agents-envs/mlagents_envs/environment.py

-                signal_name = f" ({signal_name})" if signal_name else ""
-                return_info = f"Environment shut down with return code {self._proc1.returncode}{signal_name}."
-                logger.info(return_info)
+                logger.info(self._returncode_to_env_message(self._proc1.returncode))


DRY'ed this up, can revert if it's not worth it.

chriselion · 2021-01-22T19:31:49Z

ml-agents-envs/mlagents_envs/rpc_communicator.py

+        so that we can stop sooner and raise a more appropriate error.
+        """
+        wait_time_remaining = self.timeout_wait
+        callback_timeout_wait = self.timeout_wait // 10


Could make the "10" configurable, but seemed like a good rule of thumb.

Don't really care if the timeout is configurable, but if you do make it configurable please make sure to check it is less than self.timeout_wait

chriselion · 2021-01-22T19:32:59Z

ml-agents/mlagents/trainers/subprocess_env_manager.py

@@ -189,7 +189,7 @@ def _generate_all_results() -> AllStepResult:
        )
        _send_response(EnvironmentCommand.ENV_EXITED, ex)
    except Exception as ex:
-        logger.error(
+        logger.exception(


This will print the exception callstack along with the provided message.

chriselion · 2021-01-22T19:33:41Z

Will add unit test and changelog this afternoon.

hvpeteet

Thanks Chris, looks great to me.

hvpeteet · 2021-01-22T20:49:01Z

ml-agents-envs/mlagents_envs/rpc_communicator.py

+        so that we can stop sooner and raise a more appropriate error.
+        """
+        wait_time_remaining = self.timeout_wait
+        callback_timeout_wait = self.timeout_wait // 10


Don't really care if the timeout is configurable, but if you do make it configurable please make sure to check it is less than self.timeout_wait

ml-agents-envs/mlagents_envs/rpc_communicator.py

ml-agents-envs/mlagents_envs/environment.py

chriselion · 2021-01-23T00:25:10Z

ml-agents-envs/mlagents_envs/tests/test_rpc_communicator.py

+@mock.patch.object(
+    mlagents_envs.rpc_communicator, "UnityToExternalServicerImplementation"
+)
+def test_rpc_communicator_initialize_OK(mock_impl, mock_grpc_server):


test_rpc_communicator_initialize_OK and test_rpc_communicator_initialize_timeout passed with the old functionality too; test_rpc_communicator_initialize_callback is the new stuff.

hyuDev · 2021-05-13T02:47:20Z

HI I got a error msg( mlagents_envs.exception.UnityEnvironmentException: Environment shut down with return code -6 (SIGABRT).)

the visualpushblock example runs macos but not running windows and ubuntu.
all setting is same, only gpu diffierent.
visual example run only cpu env?

Chris Elion added 3 commits January 21, 2021 17:52

Warn if null graphics, dont render in GridWorld if null graphics

4906a82

log cmdline args, don't overwrite -logfile arg

152c09e

make sure process is still alive while waiting for connection timeout

d0c548c

chriselion commented Jan 22, 2021

View reviewed changes

ml-agents-envs/mlagents_envs/environment.py Outdated Show resolved Hide resolved

chriselion commented Jan 22, 2021

View reviewed changes

chriselion requested review from hvpeteet and vincentpierre January 22, 2021 19:33

hvpeteet approved these changes Jan 22, 2021

View reviewed changes

Chris Elion added 3 commits January 22, 2021 15:49

deadline and rename proc1

61cc3f6

changelog

2cdd68b

add unit tests

cb001e6

chriselion commented Jan 23, 2021

View reviewed changes

chriselion merged commit 30bcb01 into master Jan 23, 2021

delete-merged-branch bot deleted the MLA-1712-no-graphics-crash branch January 23, 2021 01:22

chriselion mentioned this pull request May 14, 2021

Environment shut down with return code #5357

Closed

Unity-Technologies locked as resolved and limited conversation to collaborators May 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MLA-1712] Make UnityEnvironment fail fast if the env crashes #4880

[MLA-1712] Make UnityEnvironment fail fast if the env crashes #4880

Uh oh!

chriselion commented Jan 22, 2021 •

edited

Loading

Uh oh!

chriselion Jan 22, 2021

Uh oh!

chriselion Jan 22, 2021

Uh oh!

Uh oh!

chriselion Jan 22, 2021

Uh oh!

chriselion Jan 22, 2021

Uh oh!

chriselion Jan 22, 2021

Uh oh!

hvpeteet Jan 22, 2021

Uh oh!

chriselion Jan 22, 2021

Uh oh!

chriselion commented Jan 22, 2021

Uh oh!

hvpeteet left a comment

Uh oh!

hvpeteet Jan 22, 2021

Uh oh!

Uh oh!

Uh oh!

chriselion Jan 23, 2021

Uh oh!

hyuDev commented May 13, 2021

Uh oh!

Uh oh!

[MLA-1712] Make UnityEnvironment fail fast if the env crashes #4880

[MLA-1712] Make UnityEnvironment fail fast if the env crashes #4880

Uh oh!

Conversation

chriselion commented Jan 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed change(s)

Old Behavior

New Behavior

Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)

Types of change(s)

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chriselion commented Jan 22, 2021

Uh oh!

hvpeteet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyuDev commented May 13, 2021

Uh oh!

Uh oh!

chriselion commented Jan 22, 2021 •

edited

Loading