resolve EpicsMotor stalls sometimes at end of move #783

prjemian · 2019-12-13T21:31:39Z

fixes #782

mrakitin

It looks fine, but I'd like to clarify a few questions (see below).

mrakitin · 2019-12-13T21:43:07Z

ophyd/epics_motor.py

+                status._finished(success=True)
+                status.done = True
+
+        self.motor_done_move.subscribe(dmov_callback)


What if we subscribe with run=False?

You also have to remember to un-subscribe this or we will have as many callbacks as we have had motor moves!

Isn't this the matching unsubscribe? self.motor_done_move.clear_sub(dmov_callback)

Good move: run=False

What about this?

self.motor_done_move.clear_sub(dmov_callback) self.motor_done_move.unsubscribe(cbid)

with:

cbid = self.motor_done_move.subscribe(dmov_callback, run=False)

mrakitin · 2019-12-13T21:59:42Z

ophyd/epics_motor.py


+        def dmov_callback(old_value, value, **kwargs):
+            self.log.debug(f'DEBUG: dmov_callback(): new={value} old={old_value}')
+            if self._started_moving and value == 1:


What is setting self._started_moving to True?

Good question. This is tangly code.

It's in the _move_changed() method:

if not self._started_moving: started = self._started_moving = (not was_moving and self._moving)

ophyd/epics_motor.py

tacaswell · 2019-12-13T22:31:09Z

Much of this logic is already in

ophyd/ophyd/epics_motor.py

Lines 224 to 271 in 3cbf73b

    
           @required_for_connection 
        
           @motor_done_move.sub_value 
        
           def _move_changed(self, timestamp=None, value=None, sub_type=None, 
        
                             **kwargs): 
        
               '''Callback from EPICS, indicating that movement status has changed''' 
        
               was_moving = self._moving 
        
               self._moving = (value != 1) 
        
               started = False 
        
               if not self._started_moving: 
        
                   started = self._started_moving = (not was_moving and self._moving) 
        
               self.log.debug('[ts=%s] %s moving: %s (value=%s)', fmt_time(timestamp), 
        
                              self, self._moving, value) 
        
               if started: 
        
                   self._run_subs(sub_type=self.SUB_START, timestamp=timestamp, 
        
                                  value=value, **kwargs) 
        
               if was_moving and not self._moving: 
        
                   success = True 
        
                   # Check if we are moving towards the low limit switch 
        
                   if self.direction_of_travel.get() == 0: 
        
                       if self.low_limit_switch.get() == 1: 
        
                           success = False 
        
                   # No, we are going to the high limit switch 
        
                   else: 
        
                       if self.high_limit_switch.get() == 1: 
        
                           success = False 
        
                   # Check the severity of the alarm field after motion is complete. 
        
                   # If there is any alarm at all warn the user, and if the alarm is 
        
                   # greater than what is tolerated, mark the move as unsuccessful 
        
                   severity = self.user_readback.alarm_severity 
        
                   if severity != AlarmSeverity.NO_ALARM: 
        
                       status = self.user_readback.alarm_status 
        
                       if severity > self.tolerated_alarm: 
        
                           self.log.error('Motion failed: %s is in an alarm state ' 
        
                                          'status=%s severity=%s', 
        
                                          self.name, status, severity) 
        
                           success = False 
        
                       else: 
        
                           self.log.warning('Motor %s raised an alarm during motion ' 
        
                                            'status=%s severity %s', 
        
                                            self.name, status, severity) 
        
                   self._done_moving(success=success, timestamp=timestamp, 
        
                                     value=value)

I suspect that this is going to interact badly with that logic.

It is subscribed a bit differently on the 1.3.x branch

ophyd/ophyd/epics_motor.py

Line 81 in 9f21ca2

self.motor_done_move.subscribe(self._move_changed)

and

ophyd/ophyd/epics_motor.py

Lines 222 to 267 in 9f21ca2

    
           def _move_changed(self, timestamp=None, value=None, sub_type=None, 
        
                             **kwargs): 
        
               '''Callback from EPICS, indicating that movement status has changed''' 
        
               was_moving = self._moving 
        
               self._moving = (value != 1) 
        
               started = False 
        
               if not self._started_moving: 
        
                   started = self._started_moving = (not was_moving and self._moving) 
        
               self.log.debug('[ts=%s] %s moving: %s (value=%s)', fmt_time(timestamp), 
        
                              self, self._moving, value) 
        
               if started: 
        
                   self._run_subs(sub_type=self.SUB_START, timestamp=timestamp, 
        
                                  value=value, **kwargs) 
        
               if was_moving and not self._moving: 
        
                   success = True 
        
                   # Check if we are moving towards the low limit switch 
        
                   if self.direction_of_travel.get() == 0: 
        
                       if self.low_limit_switch.get() == 1: 
        
                           success = False 
        
                   # No, we are going to the high limit switch 
        
                   else: 
        
                       if self.high_limit_switch.get() == 1: 
        
                           success = False 
        
                   # Check the severity of the alarm field after motion is complete. 
        
                   # If there is any alarm at all warn the user, and if the alarm is 
        
                   # greater than what is tolerated, mark the move as unsuccessful 
        
                   severity = self.user_readback.alarm_severity 
        
                   if severity != AlarmSeverity.NO_ALARM: 
        
                       status = self.user_readback.alarm_status 
        
                       if severity > self.tolerated_alarm: 
        
                           self.log.error('Motion failed: %s is in an alarm state ' 
        
                                          'status=%s severity=%s', 
        
                                          self.name, status, severity) 
        
                           success = False 
        
                       else: 
        
                           self.log.warning('Motor %s raised an alarm during motion ' 
        
                                            'status=%s severity %s', 
        
                                            self.name, status, severity) 
        
                   self._done_moving(success=success, timestamp=timestamp, 
        
                                     value=value)

tacaswell

I believe this is leaking subscriptions and will conflict with the existing logic.

danielballan

Some Python implementation notes. How were causing the status to finish before now? Is this a "whichever happens first" scenario?

ophyd/epics_motor.py

danielballan · 2019-12-13T22:41:27Z

Haha, I see three of us pounced in parallel.Thanks for the contribution, @prjemian, and good luck sifting through our feedback. 🙃

prjemian · 2019-12-14T04:21:48Z

Much of this logic is already in

ophyd/ophyd/epics_motor.py

Lines 224 to 271 in 3cbf73b

@required_for_connection

@motor_done_move.sub_value

def _move_changed(self, timestamp=None, value=None, sub_type=None,

**kwargs):

'''Callback from EPICS, indicating that movement status has changed'''

was_moving = self._moving

self._moving = (value != 1)

started = False

if not self._started_moving:

started = self._started_moving = (not was_moving and self._moving)

self.log.debug('[ts=%s] %s moving: %s (value=%s)', fmt_time(timestamp),

self, self._moving, value)

if started:

self._run_subs(sub_type=self.SUB_START, timestamp=timestamp,

value=value, **kwargs)

if was_moving and not self._moving:

success = True

# Check if we are moving towards the low limit switch

if self.direction_of_travel.get() == 0:

if self.low_limit_switch.get() == 1:

success = False

# No, we are going to the high limit switch

else:

if self.high_limit_switch.get() == 1:

success = False

# Check the severity of the alarm field after motion is complete.

# If there is any alarm at all warn the user, and if the alarm is

# greater than what is tolerated, mark the move as unsuccessful

severity = self.user_readback.alarm_severity

if severity != AlarmSeverity.NO_ALARM:

status = self.user_readback.alarm_status

if severity > self.tolerated_alarm:

self.log.error('Motion failed: %s is in an alarm state '

'status=%s severity=%s',

self.name, status, severity)

success = False

else:

self.log.warning('Motor %s raised an alarm during motion '

'status=%s severity %s',

self.name, status, severity)

self._done_moving(success=success, timestamp=timestamp,

value=value)

I suspect that this is going to interact badly with that logic.

It is subscribed a bit differently on the 1.3.x branch

ophyd/ophyd/epics_motor.py

Line 81 in 9f21ca2

self.motor_done_move.subscribe(self._move_changed)

and

ophyd/ophyd/epics_motor.py

Lines 222 to 267 in 9f21ca2

def _move_changed(self, timestamp=None, value=None, sub_type=None,

**kwargs):

'''Callback from EPICS, indicating that movement status has changed'''

was_moving = self._moving

self._moving = (value != 1)

started = False

if not self._started_moving:

started = self._started_moving = (not was_moving and self._moving)

self.log.debug('[ts=%s] %s moving: %s (value=%s)', fmt_time(timestamp),

self, self._moving, value)

if started:

self._run_subs(sub_type=self.SUB_START, timestamp=timestamp,

value=value, **kwargs)

if was_moving and not self._moving:

success = True

# Check if we are moving towards the low limit switch

if self.direction_of_travel.get() == 0:

if self.low_limit_switch.get() == 1:

success = False

# No, we are going to the high limit switch

else:

if self.high_limit_switch.get() == 1:

success = False

# Check the severity of the alarm field after motion is complete.

# If there is any alarm at all warn the user, and if the alarm is

# greater than what is tolerated, mark the move as unsuccessful

severity = self.user_readback.alarm_severity

if severity != AlarmSeverity.NO_ALARM:

status = self.user_readback.alarm_status

if severity > self.tolerated_alarm:

self.log.error('Motion failed: %s is in an alarm state '

'status=%s severity=%s',

self.name, status, severity)

success = False

else:

self.log.warning('Motor %s raised an alarm during motion '

'status=%s severity %s',

self.name, status, severity)

self._done_moving(success=success, timestamp=timestamp,

value=value)

This might be the better place for the logic in the dmov_callback(). This logic:

        was_moving = self._moving
        self._moving = (value != 1)

should change to:

        was_moving = self._moving
        self._moving = not self.motor_done_move.value

This logic is problematic:

        if was_moving and not self._moving:

since it requires the transition to declare the move complete. Suggest relocating the done_moving()` call thus:

        if self._started_moving and not self._moving:
            self._done_moving(success=success, timestamp=timestamp,
                              value=value)

prjemian · 2019-12-14T05:09:47Z

So, removing the dmov_callback() from move() and adding to the _move_changed() method as described results in the motor stall re-appearing. Where is the status message declared done? It's not directly in positioner._done_moving(). Very tangly code. Is it one of the st._finished() lines in signal.py?

prjemian · 2019-12-14T05:22:45Z

In the end, I restored the dmov_callback() to ensure the status message is finished. There is no problem with calling this more than once, the code has an early bailout for

        if self.done:
            return

Is there a better way to be certain that st._finished() gets called?

tacaswell · 2019-12-16T13:23:43Z

ophyd/epics_motor.py

+            self.log.debug('dmov_callback(): new=%d old=%d', value, old_value)
+            if self._started_moving and value == 1:
+
+                self.motor_done_move.clear_sub(dmov_callback)


Only one of these should be required.

tacaswell · 2019-12-16T13:45:13Z

since it requires the transition to declare the move complete

My understanding is that the motor record will always flash DMOV low and then high to indicate that it is moving (even if it already at it's target destination).

I am still 👎 on this as-is because it is short-circuiting a whole bunch logic in _move_changed which I assume is there for a reason. If there are cases where that logic is currently correctly not reporting the motion done, this extra callback will flip the status and erroneously report that motion is done when it is not.

Do we have a clear understanding of the source of the problem? @prjemian do you have any timing diagrams of what the various PVs are doing when it does and does not work correctly?

My current guess is that we are hitting race-conditions on some of the internal state of EpicsMotor between the code running on the monitor thread and the code running on the main thread that is causing the next m.move() execution to overlap with the previous _move_changed execution. Thus, we miss the falling edge of DMOV to indicate that we started moving for the move the user requested and then ignore the rising edge because we think that does not belong to us.

In

ophyd/ophyd/positioner.py

Lines 195 to 198 in 3cbf73b

    
           self.subscribe(status._finished, event_type=self._SUB_REQ_DONE, 
        
                          run=False) 
        
           return status

we subscribe the status object to self._SUB_REQ_DONE on the ophyd object and then in

ophyd/ophyd/positioner.py

Lines 200 to 208 in 3cbf73b

    
           def _done_moving(self, success=True, timestamp=None, value=None, **kwargs): 
        
               '''Call when motion has completed.  Runs ``SUB_DONE`` subscription.''' 
        
               if success: 
        
                   self._run_subs(sub_type=self.SUB_DONE, timestamp=timestamp, 
        
                                  value=value) 
        
               self._run_subs(sub_type=self._SUB_REQ_DONE, success=success, 
        
                              timestamp=timestamp) 
        
               self._reset_sub(self._SUB_REQ_DONE)

process those callbacks which is in turn called from

ophyd/ophyd/epics_motor.py

Lines 270 to 271 in 3cbf73b

    
           self._done_moving(success=success, timestamp=timestamp, 
        
                             value=value)

prjemian · 2019-12-16T14:21:35Z

@tacaswell : The .DMOV transitions from 1 to 0 at the start of a move and transitions from 0 to 1 at the end, after the completion of the primary move plus any additional move complexities (such as backlash corrections or retries) controlled by the motor record. See this documentation for the details: https://epics.anl.gov/bcda/synApps/motor/motorRecord.html#Fields_status

About the .DMOV field, the documentation states:

This field is set to 0 when the motor record begins a motion, and remains 0 through any retries and backlash corrections that may be required until the motor record has completely finished that motion, whereupon the field is set to 1. DMOV is guaranteed to execute and post a 1/0/1 pulse when the motor is commanded to move--even if no motion actually occurs because the motor was commanded to move to its current position.

prjemian · 2019-12-16T14:26:53Z

It is clear from the structure of the plan used to test this problem (when executed with logging level set to DEBUG):

def move(motor, label, dest, delay_s):
    yield from bps.checkpoint()
    yield from bps.mv(m1, dest)
    print(f"{label}:   {dest} {motor.position}")
    yield from bps.sleep(delay_s)

that the next move has not yet begun. If the next move command had been issued, the RunEngine logging would indicate a checkpoint message had been processed since the most recent move activity. What makes this obvious is that no checkpoint message is logged at the time the motor stalls. The logging indicates the motor has finished its most recent move yet the RunEngine is waiting.

prjemian · 2019-12-16T14:26:59Z

Is there a better way to be certain that st._finished() gets called?

prjemian · 2019-12-16T14:36:38Z

I agree that this change has implications of change for additional methods such as _move_changed() and/or _done_moving(). What tools are there to diagnose further the handling of the status objects?

tacaswell · 2019-12-16T17:12:01Z

What is the behavior of DMOV is the motor is already moving and then we instruct it to move again?

The monitor subscriptions are run on a background thread (one sourced via libca and then trampolined to python).

The conflict is between the current move and the previous move's clean up code.

prjemian · 2019-12-16T20:28:53Z

If the .VAL field is changed while the motor is moving (.DMOV = 0), the .DMOV field will remain at zero until the new move is finished.

prjemian · 2019-12-16T22:42:56Z

The monitor subscriptions are run on a background thread (one sourced via libca and then trampolined to python).

What is trampolined?

The conflict is between the current move and the previous move's clean up code.

How can I diagnose the code in the current master at this level?

prjemian · 2019-12-16T23:15:08Z

Here's an example of the timing of changes in motor VAL, MOVN, and DMOV fields using this command: camonitor sky:m1.{VAL,MOVN,DMOV}

before the move

sky:m1.VAL                     2019-12-14 06:23:54.200602 -0.1  
sky:m1.MOVN                    2019-12-14 06:23:54.200602 0  
sky:m1.DMOV                    2019-12-14 06:23:54.200602 1

step by 1.0

sky:m1.DMOV                    2019-12-14 06:23:54.200602 0  
sky:m1.VAL                     2019-12-16 17:07:04.776139 0.9  
sky:m1.MOVN                    2019-12-16 17:07:04.776439 1  
sky:m1.MOVN                    2019-12-16 17:07:06.048334 0  
sky:m1.DMOV                    2019-12-16 17:07:06.048334 1

step by 1.0, then step by 1.0 during the first move

sky:m1.DMOV                    2019-12-16 17:07:06.048334 0  
sky:m1.VAL                     2019-12-16 17:07:23.618395 1.9  
sky:m1.MOVN                    2019-12-16 17:07:23.620621 1  
sky:m1.VAL                     2019-12-16 17:07:24.163468 2.9  
sky:m1.VAL                     2019-12-16 17:07:24.840445 2.9  
sky:m1.MOVN                    2019-12-16 17:07:24.840445 0  
sky:m1.MOVN                    2019-12-16 17:07:24.840492 1  
sky:m1.MOVN                    2019-12-16 17:07:26.143724 0  
sky:m1.DMOV                    2019-12-16 17:07:26.143724 1

step by -1.0 (with backlash correction)

sky:m1.DMOV                    2019-12-16 17:11:47.487712 0  
sky:m1.VAL                     2019-12-16 17:12:13.615191 1.9  
sky:m1.MOVN                    2019-12-16 17:12:13.615351 1  
sky:m1.MOVN                    2019-12-16 17:12:14.883530 0  
sky:m1.MOVN                    2019-12-16 17:12:14.883580 1  
sky:m1.MOVN                    2019-12-16 17:12:15.184805 0  
sky:m1.DMOV                    2019-12-16 17:12:15.184805 1

step by -1.0 (with backlash correction), then step by -1.0 during the first move

sky:m1.DMOV                    2019-12-16 17:12:15.184805 0  
sky:m1.VAL                     2019-12-16 17:12:26.397074 0.9  
sky:m1.MOVN                    2019-12-16 17:12:26.397250 1  
sky:m1.VAL                     2019-12-16 17:12:26.997775 -0.1  
sky:m1.VAL                     2019-12-16 17:12:27.622810 -0.1  
sky:m1.MOVN                    2019-12-16 17:12:27.622810 0  
sky:m1.DMOV                    2019-12-16 17:12:27.622810 0  
sky:m1.MOVN                    2019-12-16 17:12:27.622865 1  
sky:m1.MOVN                    2019-12-16 17:12:28.925954 0  
sky:m1.MOVN                    2019-12-16 17:12:28.926168 1  
sky:m1.MOVN                    2019-12-16 17:12:29.226441 0  
sky:m1.DMOV                    2019-12-16 17:12:29.226441 1

tacaswell · 2019-12-17T14:13:38Z

What is trampolined?

The code for this now lives in https://github.com/bluesky/ophyd/blob/master/ophyd/_dispatch.py . Due to how contexts work in libca/pyepics (this is the c-level structure that holds the various sockets / threads etc required to handle the asynchronous part of the epics protocol and manages the socket <-> virtual circuit <-> channel mapping) running python code in response to a monitor that does further caput/caget will fail. I do not remember exactly why this is off the top of my head (maybe @klauer can remind me ;) ). Instead the function we actually register with pyepics just "bounces" the values to a second queue that we process in a python thread which gets around this problem.

I think we just have to start putting in logging statements and sort out the timing between the things happening on different threads and at the IOC level. I don't know any shortcut to tracking down race conditions, but at least we have a way to reliably reproduce this one!

prjemian · 2019-12-17T15:23:54Z

Thanks for the explanation, @tacaswell ! The next thing to do is to return to the master branch and begin diagnosis with that code, to determine what needs to be changed so the motor stall condition does not recur. I'll pause further work on this branch until that diagnosis is fruitful.

prjemian · 2019-12-17T23:08:24Z

Will make a VM image available to the DAMA team.

prjemian · 2019-12-22T20:00:52Z

Do we have a clear understanding of the source of the problem? @prjemian do you have any timing diagrams of what the various PVs are doing when it does and does not work correctly?

Virtual Circuit Unresponsive from EPICS libca: See the three adjacent comments starting with: #787 (comment)

prjemian · 2020-04-04T14:24:49Z

Possibly related to #788? Will test new ophyd release 1.4.0 to see if this remains a problem.

prjemian · 2020-07-22T18:47:38Z

Need to stress test this with current release to see if still a problem.

prjemian · 2023-03-29T21:08:46Z

dropping this branch & PR

fixes #782

33e3d38

prjemian added the bug label Dec 13, 2019

prjemian requested review from tacaswell, danielballan, RussBerg and mrakitin December 13, 2019 21:31

mrakitin reviewed Dec 13, 2019

View reviewed changes

tacaswell requested changes Dec 13, 2019

View reviewed changes

danielballan reviewed Dec 13, 2019

View reviewed changes

ophyd/epics_motor.py Outdated Show resolved Hide resolved

ophyd/epics_motor.py Outdated Show resolved Hide resolved

ophyd/epics_motor.py Outdated Show resolved Hide resolved

MNT #782 per review comments

39a0a43

prjemian added 2 commits December 13, 2019 23:17

MNT #782 per review, finish the status

a3aeb27

MNT #782 remove commented code

e3890c8

tacaswell reviewed Dec 16, 2019

View reviewed changes

prjemian mentioned this pull request Dec 17, 2019

ophyd.EpicsMotor stalls sometimes. Why? prjemian/bstesting#1

Open

prjemian mentioned this pull request Jan 6, 2020

re-install latest production release of ophyd aps-8id-dys/ipython-8idiuser#123

Closed

prjemian added the wontfix label Mar 29, 2023

prjemian closed this Mar 29, 2023

prjemian deleted the 782-stalled-EpicsMotor branch March 29, 2023 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resolve EpicsMotor stalls sometimes at end of move #783

resolve EpicsMotor stalls sometimes at end of move #783

prjemian commented Dec 13, 2019

mrakitin left a comment

mrakitin Dec 13, 2019

tacaswell Dec 13, 2019

prjemian Dec 14, 2019

prjemian Dec 14, 2019

prjemian Dec 14, 2019

prjemian Dec 14, 2019

mrakitin Dec 13, 2019

prjemian Dec 14, 2019

prjemian Dec 14, 2019

tacaswell commented Dec 13, 2019

tacaswell left a comment

danielballan left a comment

danielballan commented Dec 13, 2019

prjemian commented Dec 14, 2019 •

edited

Loading

prjemian commented Dec 14, 2019

prjemian commented Dec 14, 2019

tacaswell Dec 16, 2019

tacaswell commented Dec 16, 2019

prjemian commented Dec 16, 2019 •

edited

Loading

prjemian commented Dec 16, 2019

prjemian commented Dec 16, 2019

prjemian commented Dec 16, 2019

tacaswell commented Dec 16, 2019

prjemian commented Dec 16, 2019

prjemian commented Dec 16, 2019

prjemian commented Dec 16, 2019

tacaswell commented Dec 17, 2019

prjemian commented Dec 17, 2019

prjemian commented Dec 17, 2019

prjemian commented Dec 22, 2019

prjemian commented Apr 4, 2020

prjemian commented Jul 22, 2020

prjemian commented Mar 29, 2023

resolve EpicsMotor stalls sometimes at end of move #783

resolve EpicsMotor stalls sometimes at end of move #783

Conversation

prjemian commented Dec 13, 2019

mrakitin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tacaswell commented Dec 13, 2019

tacaswell left a comment

Choose a reason for hiding this comment

danielballan left a comment

Choose a reason for hiding this comment

danielballan commented Dec 13, 2019

prjemian commented Dec 14, 2019 • edited Loading

prjemian commented Dec 14, 2019

prjemian commented Dec 14, 2019

Choose a reason for hiding this comment

tacaswell commented Dec 16, 2019

prjemian commented Dec 16, 2019 • edited Loading

prjemian commented Dec 16, 2019

prjemian commented Dec 16, 2019

prjemian commented Dec 16, 2019

tacaswell commented Dec 16, 2019

prjemian commented Dec 16, 2019

prjemian commented Dec 16, 2019

prjemian commented Dec 16, 2019

tacaswell commented Dec 17, 2019

prjemian commented Dec 17, 2019

prjemian commented Dec 17, 2019

prjemian commented Dec 22, 2019

prjemian commented Apr 4, 2020

prjemian commented Jul 22, 2020

prjemian commented Mar 29, 2023

prjemian commented Dec 14, 2019 •

edited

Loading

prjemian commented Dec 16, 2019 •

edited

Loading