-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial #6914 job_agent separated terminating vs destroying #7337
Conversation
robnagler
commented
Oct 30, 2024
- does not cancel the sbatch job when terminating
- job_agent _SBATCH_ID_FILE write
- job_supervisor concept of verify_status but doesn't change semantics yet
- pkcli.elegant-schema better approach to updating schema
- const.DEV_SRC_RADIASOFT_DIR
- does not cancel the sbatch job when terminating - job_agent _SBATCH_ID_FILE write - job_supervisor concept of verify_status but doesn't change semantics yet - pkcli.elegant-schema better approach to updating schema - const.DEV_SRC_RADIASOFT_DIR
- _ComputeJob.is_destroyed unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some style comments. Not sure what behavior should be implemented with this PR but I saw this:
- start an sbatch job
- kill -9 supervisor # No chance to terminate so agent is left running
- restart supervisor
- login to sbatch agent
- simulation is canceled
Is this sim supposed to be canceled?
if self._status_cb: | ||
self._status_cb.stop() | ||
self._status_cb = None | ||
self._start_ready.set() | ||
if self._sbatch_id: | ||
if self._sbatch_id and not self._terminating: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to use such an abstract name like _terminating? I feel like we have a had differences on words like this in the past. I much prefer a specific and unique name that indicates what the heck is going on. Ex _want_kill_cmds or even _want_scancel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of depends on which place you are reading. Inside terminate, terminating is set to True and that cascades down several levels. The intent in terminate() is to signal that the agent is terminating. This use parallels self._destroying
, which indicates the _Cmd
is in destroy(). There are many conditionals which use _destroying
for different purposes, but they are all relying on the fact that the object is in destroy().
In this particular case, there's only one reason to know that the process is in terminate(), but there could be other reasons, and the way the value cascades in the call to terminate() makes sense (to me at least). It's also a private attribute of the class so it's use is localized, e.g. the Dispatcher doesn't refer to it.
That's the reasoning of "what the heck is going on". It's just a different perspective: from the whole process vs the specific conditional.
This is caused by the scancel in sbatch.py when the agent starts. This PR didn't fix that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM then. Do what you will with my style comments.