Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_and_validate() in HWP Supervisor doesn't check command ran successfully #631

Open
BrianJKoopman opened this issue Feb 15, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@BrianJKoopman
Copy link
Member

I noticed that run_and_validate() doesn't actually "validate" that the command ran successfully, just that the initial OCS call goes through.

As written, it looks like Tasks could fail and the action would just move on to the next task call. session['success'] would need to be checked to "validate" each task is complete.

Originally posted by @BrianJKoopman in #628 (comment)

@BrianJKoopman BrianJKoopman added the bug Something isn't working label Feb 15, 2024
@ykyohei
Copy link
Contributor

ykyohei commented Apr 16, 2024

I think this is related to this issue.
I pasted the two examples error log in satp3 below.

Sometimes supervisor functions fail, and failure of supervisor task do not stop the schedule.
It is difficult to notice the error right now.
I think it's better to stop the schedule when the supervisor task fails.

2024-04-13T11:03:31+0000 pmx_off:3 Status is now "done".
2024-04-13T11:03:31+0000 pmx_off:3 Completed with state: ControlState.Error(traceback='Traceback (most recent call last):\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 755, in update\n    self.run_and_validate(clients.pid.declare_freq,\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 633, in run_and_validate\n    raise ControlClientError("op-wait timed out")\nocs.client_http.ControlClientError: op-wait timed out\n', start_time=1713006210.634159)

2024-04-13T11:03:30+0000 Setting state: ControlState.Error(traceback='Traceback (most recent call last):\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 755, in update\n    self.run_and_validate(clients.pid.declare_freq,\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 633, in run_and_validate\n    raise ControlClientError("op-wait timed out")\nocs.client_http.ControlClientError: op-wait timed out\n', start_time=1713006210.634159)
ocs.client_http.ControlClientError: op-wait timed out
    raise ControlClientError("op-wait timed out")
  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 633, in run_and_validate
    self.run_and_validate(clients.pid.declare_freq,
  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 755, in update
Traceback (most recent call last):
2024-04-13T11:03:30+0000 Error updating state:
2024-04-13T11:03:00+0000 Starting op: name=declare_freq, kwargs={'freq': 0}
2024-04-13T11:03:00+0000 Completed op: name=set_off, success=True, kwargs={}
2024-04-13T11:02:59+0000 Starting op: name=set_off, kwargs={}
2024-04-13T11:02:59+0000 Setting state: ControlState.PmxOff(success=True)
2024-04-16T15:28:38+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp3.hwp-pid.ops>'], {}]
2024-04-16T15:28:36+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp3.hwp-pid.ops>'], {}]
2024-04-16T15:28:34+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp3.hwp-pid.ops>'], {}]
2024-04-16T15:28:32+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp3.hwp-pid.ops>'], {}]
2024-04-16T15:28:30+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp3.hwp-pid.ops>'], {}]
2024-04-16T15:21:25+0000 pid_to_freq:4 Status is now "done".

2024-04-16T15:21:25+0000 pid_to_freq:4 Completed with state: ControlState.Error(traceback='Traceback (most recent call last):\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 653, in update\n    self.run_and_validate(clients.pid.set_direction,\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 633, in run_and_validate\n    raise ControlClientError("op-wait timed out")\nocs.client_http.ControlClientError: op-wait timed out\n', start_time=1713280884.8706326)
2024-04-16T15:21:24+0000 Setting state: ControlState.Error(traceback='Traceback (most recent call last):\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 653, in update\n    self.run_and_validate(clients.pid.set_direction,\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 633, in run_and_validate\n    raise ControlClientError("op-wait timed out")\nocs.client_http.ControlClientError: op-wait timed out\n', start_time=1713280884.8706326)
ocs.client_http.ControlClientError: op-wait timed out
    raise ControlClientError("op-wait timed out")
  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 633, in run_and_validate
    self.run_and_validate(clients.pid.set_direction,
  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 653, in update
Traceback (most recent call last):
2024-04-16T15:21:24+0000 Error updating state:
2024-04-16T15:20:54+0000 Starting op: name=set_direction, kwargs={'direction': '0'}
2024-04-16T15:20:54+0000 Setting state: ControlState.PIDToFreq(target_freq=2.0, direction='0', freq_tol=0.05, freq_tol_duration=10.0)
2024-04-16T15:20:54+0000 pid_to_freq:4 Status is now "starting".
2024-04-16T15:20:54+0000 start called for pid_to_freq

@ykyohei
Copy link
Contributor

ykyohei commented Apr 16, 2024

Once we fix this issue, I think I can test this in satp3 by intentionally turn off one of the hwp-agent and run the supervisor task.

@jlashner
Copy link
Collaborator

Hi Kyohei, do you mind posting the code with how you are running supervisor tasks?

If you are running it through a hwp-supervisor client as opposed to through sorunlib, you need to manually check if the session finished successfully, since a failed agent operation will not otherwise raise an error in the client script. For instance, everything in sorunlib uses this check_response function to make sure the task finished successfully.

If you are already using sorunlib directly, than we need to figure out why this is not failing, since I'm pretty sure when an operation enters an Error state like in your examples above, session['success'] should already be false.

@ykyohei
Copy link
Contributor

ykyohei commented Apr 17, 2024

Hi Jack, I was running through hwp-supervisor client as follows.

sup = OCSClient('hwp-supervisor')

sup.pid_to_freq(target_freq = 2.0)
sup.pmx_off()

I actually didn't know that it is possible to run these from sorunlib. I will try again through sorunlib.
https://github.com/simonsobs/sorunlib/blob/7170fb9b62e8c29adb34912dbbd86f9521e672eb/src/sorunlib/hwp.py
If I run these commands through sorunlib, and if one of the agent operations fails, should it raise an error and stop the schedule?
And if I run through hwp-supervisor client, I should manually do something like as follows?

_, _, session = sup.pid_to_freq(target_freq = 2.0)
assert session['success']

I'm confused about the point of this issue now. We should run the supervisor task through sorunlib, and as long as we do so, this issue that Brian points out is not a real issue?

Also, do you think we should implement gripper.grip, gripper.ungrip, supervisor.enable_driver_board, and supervisor.disable_driver_board in sorunlib for the same reason?

@jlashner
Copy link
Collaborator

yep, asserting session['success'] should work!

Brian is correctly pointing out that this needs to be asserted wherever client operations are called, but the hwp-supervisor is not checking when it calls out to other agents.

This needs to be done both in the supervisor agent, and also any client script running in nextline.

@jlashner
Copy link
Collaborator

I think implementing these functions in sorunlib is a good idea to minimize the possibility of people making mistakes when calling them, but it isn't strictly necessary to get it to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants