Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resolve EpicsMotor stalls sometimes at end of move #783

Closed
wants to merge 4 commits into from

Conversation

prjemian
Copy link
Contributor

fixes #782

Copy link
Member

@mrakitin mrakitin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks fine, but I'd like to clarify a few questions (see below).

status._finished(success=True)
status.done = True

self.motor_done_move.subscribe(dmov_callback)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we subscribe with run=False?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also have to remember to un-subscribe this or we will have as many callbacks as we have had motor moves!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the matching unsubscribe? self.motor_done_move.clear_sub(dmov_callback)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good move: run=False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this?

                self.motor_done_move.clear_sub(dmov_callback)
                self.motor_done_move.unsubscribe(cbid)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with:

        cbid = self.motor_done_move.subscribe(dmov_callback, run=False)


def dmov_callback(old_value, value, **kwargs):
self.log.debug(f'DEBUG: dmov_callback(): new={value} old={old_value}')
if self._started_moving and value == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is setting self._started_moving to True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This is tangly code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the _move_changed() method:

        if not self._started_moving:
            started = self._started_moving = (not was_moving and self._moving)

ophyd/epics_motor.py Outdated Show resolved Hide resolved
@tacaswell
Copy link
Contributor

Much of this logic is already in

ophyd/ophyd/epics_motor.py

Lines 224 to 271 in 3cbf73b

@required_for_connection
@motor_done_move.sub_value
def _move_changed(self, timestamp=None, value=None, sub_type=None,
**kwargs):
'''Callback from EPICS, indicating that movement status has changed'''
was_moving = self._moving
self._moving = (value != 1)
started = False
if not self._started_moving:
started = self._started_moving = (not was_moving and self._moving)
self.log.debug('[ts=%s] %s moving: %s (value=%s)', fmt_time(timestamp),
self, self._moving, value)
if started:
self._run_subs(sub_type=self.SUB_START, timestamp=timestamp,
value=value, **kwargs)
if was_moving and not self._moving:
success = True
# Check if we are moving towards the low limit switch
if self.direction_of_travel.get() == 0:
if self.low_limit_switch.get() == 1:
success = False
# No, we are going to the high limit switch
else:
if self.high_limit_switch.get() == 1:
success = False
# Check the severity of the alarm field after motion is complete.
# If there is any alarm at all warn the user, and if the alarm is
# greater than what is tolerated, mark the move as unsuccessful
severity = self.user_readback.alarm_severity
if severity != AlarmSeverity.NO_ALARM:
status = self.user_readback.alarm_status
if severity > self.tolerated_alarm:
self.log.error('Motion failed: %s is in an alarm state '
'status=%s severity=%s',
self.name, status, severity)
success = False
else:
self.log.warning('Motor %s raised an alarm during motion '
'status=%s severity %s',
self.name, status, severity)
self._done_moving(success=success, timestamp=timestamp,
value=value)

I suspect that this is going to interact badly with that logic.


It is subscribed a bit differently on the 1.3.x branch

self.motor_done_move.subscribe(self._move_changed)
and

ophyd/ophyd/epics_motor.py

Lines 222 to 267 in 9f21ca2

def _move_changed(self, timestamp=None, value=None, sub_type=None,
**kwargs):
'''Callback from EPICS, indicating that movement status has changed'''
was_moving = self._moving
self._moving = (value != 1)
started = False
if not self._started_moving:
started = self._started_moving = (not was_moving and self._moving)
self.log.debug('[ts=%s] %s moving: %s (value=%s)', fmt_time(timestamp),
self, self._moving, value)
if started:
self._run_subs(sub_type=self.SUB_START, timestamp=timestamp,
value=value, **kwargs)
if was_moving and not self._moving:
success = True
# Check if we are moving towards the low limit switch
if self.direction_of_travel.get() == 0:
if self.low_limit_switch.get() == 1:
success = False
# No, we are going to the high limit switch
else:
if self.high_limit_switch.get() == 1:
success = False
# Check the severity of the alarm field after motion is complete.
# If there is any alarm at all warn the user, and if the alarm is
# greater than what is tolerated, mark the move as unsuccessful
severity = self.user_readback.alarm_severity
if severity != AlarmSeverity.NO_ALARM:
status = self.user_readback.alarm_status
if severity > self.tolerated_alarm:
self.log.error('Motion failed: %s is in an alarm state '
'status=%s severity=%s',
self.name, status, severity)
success = False
else:
self.log.warning('Motor %s raised an alarm during motion '
'status=%s severity %s',
self.name, status, severity)
self._done_moving(success=success, timestamp=timestamp,
value=value)

Copy link
Contributor

@tacaswell tacaswell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is leaking subscriptions and will conflict with the existing logic.

Copy link
Member

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some Python implementation notes. How were causing the status to finish before now? Is this a "whichever happens first" scenario?

ophyd/epics_motor.py Outdated Show resolved Hide resolved
ophyd/epics_motor.py Outdated Show resolved Hide resolved
ophyd/epics_motor.py Outdated Show resolved Hide resolved
@danielballan
Copy link
Member

Haha, I see three of us pounced in parallel.Thanks for the contribution, @prjemian, and good luck sifting through our feedback. 🙃

@prjemian
Copy link
Contributor Author

prjemian commented Dec 14, 2019

Much of this logic is already in

ophyd/ophyd/epics_motor.py

Lines 224 to 271 in 3cbf73b

@required_for_connection
@motor_done_move.sub_value
def _move_changed(self, timestamp=None, value=None, sub_type=None,
**kwargs):
'''Callback from EPICS, indicating that movement status has changed'''
was_moving = self._moving
self._moving = (value != 1)
started = False
if not self._started_moving:
started = self._started_moving = (not was_moving and self._moving)
self.log.debug('[ts=%s] %s moving: %s (value=%s)', fmt_time(timestamp),
self, self._moving, value)
if started:
self._run_subs(sub_type=self.SUB_START, timestamp=timestamp,
value=value, **kwargs)
if was_moving and not self._moving:
success = True
# Check if we are moving towards the low limit switch
if self.direction_of_travel.get() == 0:
if self.low_limit_switch.get() == 1:
success = False
# No, we are going to the high limit switch
else:
if self.high_limit_switch.get() == 1:
success = False
# Check the severity of the alarm field after motion is complete.
# If there is any alarm at all warn the user, and if the alarm is
# greater than what is tolerated, mark the move as unsuccessful
severity = self.user_readback.alarm_severity
if severity != AlarmSeverity.NO_ALARM:
status = self.user_readback.alarm_status
if severity > self.tolerated_alarm:
self.log.error('Motion failed: %s is in an alarm state '
'status=%s severity=%s',
self.name, status, severity)
success = False
else:
self.log.warning('Motor %s raised an alarm during motion '
'status=%s severity %s',
self.name, status, severity)
self._done_moving(success=success, timestamp=timestamp,
value=value)

I suspect that this is going to interact badly with that logic.

It is subscribed a bit differently on the 1.3.x branch

self.motor_done_move.subscribe(self._move_changed)

and

ophyd/ophyd/epics_motor.py

Lines 222 to 267 in 9f21ca2

def _move_changed(self, timestamp=None, value=None, sub_type=None,
**kwargs):
'''Callback from EPICS, indicating that movement status has changed'''
was_moving = self._moving
self._moving = (value != 1)
started = False
if not self._started_moving:
started = self._started_moving = (not was_moving and self._moving)
self.log.debug('[ts=%s] %s moving: %s (value=%s)', fmt_time(timestamp),
self, self._moving, value)
if started:
self._run_subs(sub_type=self.SUB_START, timestamp=timestamp,
value=value, **kwargs)
if was_moving and not self._moving:
success = True
# Check if we are moving towards the low limit switch
if self.direction_of_travel.get() == 0:
if self.low_limit_switch.get() == 1:
success = False
# No, we are going to the high limit switch
else:
if self.high_limit_switch.get() == 1:
success = False
# Check the severity of the alarm field after motion is complete.
# If there is any alarm at all warn the user, and if the alarm is
# greater than what is tolerated, mark the move as unsuccessful
severity = self.user_readback.alarm_severity
if severity != AlarmSeverity.NO_ALARM:
status = self.user_readback.alarm_status
if severity > self.tolerated_alarm:
self.log.error('Motion failed: %s is in an alarm state '
'status=%s severity=%s',
self.name, status, severity)
success = False
else:
self.log.warning('Motor %s raised an alarm during motion '
'status=%s severity %s',
self.name, status, severity)
self._done_moving(success=success, timestamp=timestamp,
value=value)

This might be the better place for the logic in the dmov_callback(). This logic:

        was_moving = self._moving
        self._moving = (value != 1)

should change to:

        was_moving = self._moving
        self._moving = not self.motor_done_move.value

This logic is problematic:

        if was_moving and not self._moving:

since it requires the transition to declare the move complete. Suggest relocating the done_moving()` call thus:

        if self._started_moving and not self._moving:
            self._done_moving(success=success, timestamp=timestamp,
                              value=value)

@prjemian
Copy link
Contributor Author

So, removing the dmov_callback() from move() and adding to the _move_changed() method as described results in the motor stall re-appearing. Where is the status message declared done? It's not directly in positioner._done_moving(). Very tangly code. Is it one of the st._finished() lines in signal.py?

@prjemian
Copy link
Contributor Author

In the end, I restored the dmov_callback() to ensure the status message is finished. There is no problem with calling this more than once, the code has an early bailout for

        if self.done:
            return

Is there a better way to be certain that st._finished() gets called?

self.log.debug('dmov_callback(): new=%d old=%d', value, old_value)
if self._started_moving and value == 1:

self.motor_done_move.clear_sub(dmov_callback)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one of these should be required.

@tacaswell
Copy link
Contributor

since it requires the transition to declare the move complete

My understanding is that the motor record will always flash DMOV low and then high to indicate that it is moving (even if it already at it's target destination).

I am still 👎 on this as-is because it is short-circuiting a whole bunch logic in _move_changed which I assume is there for a reason. If there are cases where that logic is currently correctly not reporting the motion done, this extra callback will flip the status and erroneously report that motion is done when it is not.

Do we have a clear understanding of the source of the problem? @prjemian do you have any timing diagrams of what the various PVs are doing when it does and does not work correctly?

My current guess is that we are hitting race-conditions on some of the internal state of EpicsMotor between the code running on the monitor thread and the code running on the main thread that is causing the next m.move() execution to overlap with the previous _move_changed execution. Thus, we miss the falling edge of DMOV to indicate that we started moving for the move the user requested and then ignore the rising edge because we think that does not belong to us.


In

ophyd/ophyd/positioner.py

Lines 195 to 198 in 3cbf73b

self.subscribe(status._finished, event_type=self._SUB_REQ_DONE,
run=False)
return status

we subscribe the status object to self._SUB_REQ_DONE on the ophyd object and then in

ophyd/ophyd/positioner.py

Lines 200 to 208 in 3cbf73b

def _done_moving(self, success=True, timestamp=None, value=None, **kwargs):
'''Call when motion has completed. Runs ``SUB_DONE`` subscription.'''
if success:
self._run_subs(sub_type=self.SUB_DONE, timestamp=timestamp,
value=value)
self._run_subs(sub_type=self._SUB_REQ_DONE, success=success,
timestamp=timestamp)
self._reset_sub(self._SUB_REQ_DONE)
process those callbacks which is in turn called from

ophyd/ophyd/epics_motor.py

Lines 270 to 271 in 3cbf73b

self._done_moving(success=success, timestamp=timestamp,
value=value)

@prjemian
Copy link
Contributor Author

prjemian commented Dec 16, 2019

@tacaswell : The .DMOV transitions from 1 to 0 at the start of a move and transitions from 0 to 1 at the end, after the completion of the primary move plus any additional move complexities (such as backlash corrections or retries) controlled by the motor record. See this documentation for the details: https://epics.anl.gov/bcda/synApps/motor/motorRecord.html#Fields_status

About the .DMOV field, the documentation states:

This field is set to 0 when the motor record begins a motion, and remains 0 through any retries and backlash corrections that may be required until the motor record has completely finished that motion, whereupon the field is set to 1. DMOV is guaranteed to execute and post a 1/0/1 pulse when the motor is commanded to move--even if no motion actually occurs because the motor was commanded to move to its current position.

@prjemian
Copy link
Contributor Author

It is clear from the structure of the plan used to test this problem (when executed with logging level set to DEBUG):

def move(motor, label, dest, delay_s):
    yield from bps.checkpoint()
    yield from bps.mv(m1, dest)
    print(f"{label}:   {dest} {motor.position}")
    yield from bps.sleep(delay_s)

that the next move has not yet begun. If the next move command had been issued, the RunEngine logging would indicate a checkpoint message had been processed since the most recent move activity. What makes this obvious is that no checkpoint message is logged at the time the motor stalls. The logging indicates the motor has finished its most recent move yet the RunEngine is waiting.

@prjemian
Copy link
Contributor Author

Is there a better way to be certain that st._finished() gets called?

@prjemian
Copy link
Contributor Author

I agree that this change has implications of change for additional methods such as _move_changed() and/or _done_moving(). What tools are there to diagnose further the handling of the status objects?

@tacaswell
Copy link
Contributor

What is the behavior of DMOV is the motor is already moving and then we instruct it to move again?

The monitor subscriptions are run on a background thread (one sourced via libca and then trampolined to python).

The conflict is between the current move and the previous move's clean up code.

@prjemian
Copy link
Contributor Author

If the .VAL field is changed while the motor is moving (.DMOV = 0), the .DMOV field will remain at zero until the new move is finished.

@prjemian
Copy link
Contributor Author

The monitor subscriptions are run on a background thread (one sourced via libca and then trampolined to python).

What is trampolined?

The conflict is between the current move and the previous move's clean up code.

How can I diagnose the code in the current master at this level?

@prjemian
Copy link
Contributor Author

Here's an example of the timing of changes in motor VAL, MOVN, and DMOV fields using this command: camonitor sky:m1.{VAL,MOVN,DMOV}

before the move

sky:m1.VAL                     2019-12-14 06:23:54.200602 -0.1  
sky:m1.MOVN                    2019-12-14 06:23:54.200602 0  
sky:m1.DMOV                    2019-12-14 06:23:54.200602 1  

step by 1.0

sky:m1.DMOV                    2019-12-14 06:23:54.200602 0  
sky:m1.VAL                     2019-12-16 17:07:04.776139 0.9  
sky:m1.MOVN                    2019-12-16 17:07:04.776439 1  
sky:m1.MOVN                    2019-12-16 17:07:06.048334 0  
sky:m1.DMOV                    2019-12-16 17:07:06.048334 1

step by 1.0, then step by 1.0 during the first move

sky:m1.DMOV                    2019-12-16 17:07:06.048334 0  
sky:m1.VAL                     2019-12-16 17:07:23.618395 1.9  
sky:m1.MOVN                    2019-12-16 17:07:23.620621 1  
sky:m1.VAL                     2019-12-16 17:07:24.163468 2.9  
sky:m1.VAL                     2019-12-16 17:07:24.840445 2.9  
sky:m1.MOVN                    2019-12-16 17:07:24.840445 0  
sky:m1.MOVN                    2019-12-16 17:07:24.840492 1  
sky:m1.MOVN                    2019-12-16 17:07:26.143724 0  
sky:m1.DMOV                    2019-12-16 17:07:26.143724 1

step by -1.0 (with backlash correction)

sky:m1.DMOV                    2019-12-16 17:11:47.487712 0  
sky:m1.VAL                     2019-12-16 17:12:13.615191 1.9  
sky:m1.MOVN                    2019-12-16 17:12:13.615351 1  
sky:m1.MOVN                    2019-12-16 17:12:14.883530 0  
sky:m1.MOVN                    2019-12-16 17:12:14.883580 1  
sky:m1.MOVN                    2019-12-16 17:12:15.184805 0  
sky:m1.DMOV                    2019-12-16 17:12:15.184805 1  

step by -1.0 (with backlash correction), then step by -1.0 during the first move

sky:m1.DMOV                    2019-12-16 17:12:15.184805 0  
sky:m1.VAL                     2019-12-16 17:12:26.397074 0.9  
sky:m1.MOVN                    2019-12-16 17:12:26.397250 1  
sky:m1.VAL                     2019-12-16 17:12:26.997775 -0.1  
sky:m1.VAL                     2019-12-16 17:12:27.622810 -0.1  
sky:m1.MOVN                    2019-12-16 17:12:27.622810 0  
sky:m1.DMOV                    2019-12-16 17:12:27.622810 0  
sky:m1.MOVN                    2019-12-16 17:12:27.622865 1  
sky:m1.MOVN                    2019-12-16 17:12:28.925954 0  
sky:m1.MOVN                    2019-12-16 17:12:28.926168 1  
sky:m1.MOVN                    2019-12-16 17:12:29.226441 0  
sky:m1.DMOV                    2019-12-16 17:12:29.226441 1  

@tacaswell
Copy link
Contributor

What is trampolined?

The code for this now lives in https://github.com/bluesky/ophyd/blob/master/ophyd/_dispatch.py . Due to how contexts work in libca/pyepics (this is the c-level structure that holds the various sockets / threads etc required to handle the asynchronous part of the epics protocol and manages the socket <-> virtual circuit <-> channel mapping) running python code in response to a monitor that does further caput/caget will fail. I do not remember exactly why this is off the top of my head (maybe @klauer can remind me ;) ). Instead the function we actually register with pyepics just "bounces" the values to a second queue that we process in a python thread which gets around this problem.


I think we just have to start putting in logging statements and sort out the timing between the things happening on different threads and at the IOC level. I don't know any shortcut to tracking down race conditions, but at least we have a way to reliably reproduce this one!

@prjemian
Copy link
Contributor Author

Thanks for the explanation, @tacaswell ! The next thing to do is to return to the master branch and begin diagnosis with that code, to determine what needs to be changed so the motor stall condition does not recur. I'll pause further work on this branch until that diagnosis is fruitful.

@prjemian
Copy link
Contributor Author

Will make a VM image available to the DAMA team.

@prjemian
Copy link
Contributor Author

Do we have a clear understanding of the source of the problem? @prjemian do you have any timing diagrams of what the various PVs are doing when it does and does not work correctly?

Virtual Circuit Unresponsive from EPICS libca: See the three adjacent comments starting with: #787 (comment)

@prjemian
Copy link
Contributor Author

prjemian commented Apr 4, 2020

Possibly related to #788? Will test new ophyd release 1.4.0 to see if this remains a problem.

@prjemian
Copy link
Contributor Author

Need to stress test this with current release to see if still a problem.

@prjemian
Copy link
Contributor Author

dropping this branch & PR

@prjemian prjemian closed this Mar 29, 2023
@prjemian prjemian deleted the 782-stalled-EpicsMotor branch March 29, 2023 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

motor motion fails to complete #1
4 participants