-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLA-1712] Make UnityEnvironment fail fast if the env crashes #4880
Conversation
@@ -4,7 +4,7 @@ | |||
UnityConnectSettings: | |||
m_ObjectHideFlags: 0 | |||
serializedVersion: 1 | |||
m_Enabled: 1 | |||
m_Enabled: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automatically set during player build. Might as well commit it now so we don't have to keep undoing it.
|
||
# If the logfile arg isn't already set in the env args, | ||
# try to set it to an output directory | ||
logfile_set = "-logfile" in (arg.lower() for arg in self._additional_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was annoying - if you wanted to pass e.g. --env-args -logfile -
to mlagents-learn
it would get overwritten here.
poll_res = self._proc1.poll() | ||
if poll_res is not None: | ||
exc_msg = self._returncode_to_env_message(self._proc1.returncode) | ||
raise UnityEnvironmentException(exc_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we want to have separate exception messages for returncode 0 and non-0.
signal_name = f" ({signal_name})" if signal_name else "" | ||
return_info = f"Environment shut down with return code {self._proc1.returncode}{signal_name}." | ||
logger.info(return_info) | ||
logger.info(self._returncode_to_env_message(self._proc1.returncode)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DRY'ed this up, can revert if it's not worth it.
so that we can stop sooner and raise a more appropriate error. | ||
""" | ||
wait_time_remaining = self.timeout_wait | ||
callback_timeout_wait = self.timeout_wait // 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could make the "10" configurable, but seemed like a good rule of thumb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't really care if the timeout is configurable, but if you do make it configurable please make sure to check it is less than self.timeout_wait
@@ -189,7 +189,7 @@ def _generate_all_results() -> AllStepResult: | |||
) | |||
_send_response(EnvironmentCommand.ENV_EXITED, ex) | |||
except Exception as ex: | |||
logger.error( | |||
logger.exception( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will print the exception callstack along with the provided message.
Will add unit test and changelog this afternoon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Chris, looks great to me.
so that we can stop sooner and raise a more appropriate error. | ||
""" | ||
wait_time_remaining = self.timeout_wait | ||
callback_timeout_wait = self.timeout_wait // 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't really care if the timeout is configurable, but if you do make it configurable please make sure to check it is less than self.timeout_wait
@mock.patch.object( | ||
mlagents_envs.rpc_communicator, "UnityToExternalServicerImplementation" | ||
) | ||
def test_rpc_communicator_initialize_OK(mock_impl, mock_grpc_server): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_rpc_communicator_initialize_OK and test_rpc_communicator_initialize_timeout passed with the old functionality too; test_rpc_communicator_initialize_callback is the new stuff.
HI I got a error msg( mlagents_envs.exception.UnityEnvironmentException: Environment shut down with return code -6 (SIGABRT).) the visualpushblock example runs macos but not running windows and ubuntu. |
Proposed change(s)
Currently, if the Unity process crashes hard (e.g. SIGABRT) without cleaning up, then UnityEnvironment will
This was uncovered while looking into what happens if you use a CameraSensor with
--no-graphics
, but could happen in any other crash too.The main change here is to break the main
Connection.poll()
call inRpcCommunicator.poll_for_timeout()
into smallerpoll()
, check the subprocess status in between, and raise if the process exited. To avoidRpcCommunicator
needing to know aboutSubprocessEnvironmentManager
, this is done by passing an optionalCallable
toCommunicator
/RpcCommunicator
methods.Other smaller changes (I can break these into other PRs if desired):
-logfile
args if they're passed via--env-args
Old Behavior
(elapsed time 60 seconds, inaccurate exception message)
New Behavior
(elapsed time 6 seconds, better message)
Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)
https://jira.unity3d.com/browse/MLA-1712
https://jira.unity3d.com/browse/MLA-1713
(no jira for the -logfile overwriting)
Types of change(s)
Checklist