Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

FailedStatus exception during AD_Acquire #114

Closed
prjemian opened this issue Nov 25, 2019 · 34 comments
Closed

FailedStatus exception during AD_Acquire #114

prjemian opened this issue Nov 25, 2019 · 34 comments
Labels
bug Something isn't working

Comments

@prjemian
Copy link
Collaborator

This failed (but succeeded when we just ran it again). Example of an error that comes seemingly at random.

In [1]: %run -m ad_acquire                                                                                                                                                                                              
I Mon-12:02:26 - /home/beams/8IDIUSER/.ipython-bluesky/user/ad_acquire.py

In [2]: RE(lambdadet.cam.setOperatingMode(lambdadet.cam.LAMBDA_OPERATING_MODE))                                                                                                                                         
I Mon-12:02:39 - Lambda Operating Mode switched to: 0
Out[2]: ()

In [3]: lambdadet.cam.EXT_TRIGGER                                                                                                                                                                                       
Out[3]: 0

In [4]: lambdadet.cam.EXT_TRIGGER=2                                                                                                                                                                                     

In [5]: lambdadet.cam.LAMBDA_OPERATING_MODE                                                                                                                                                                             
Out[5]: 0

In [6]: lambdadet.cam.LAMBDA_OPERATING_MODE=1                                                                                                                                                                           

In [7]: RE(AD_Acquire(lambdadet, .2, .2, 100, submit_xpcs_job=True, path="/data/2019-3/dufresne201911", file_name="X003_test"))                                                                                         
I Mon-12:03:23 - AD_Acquire starting
I Mon-12:03:23 - file_path = /data/2019-3/dufresne201911/X003_test/
/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/epics/ca.py:1459: UserWarning: ca.get('8idi:Reg2') timed out after 1.00 seconds.
  warnings.warn(msg % (name(chid), timeout))
[E 12:03:24.433 run_engine:1261] Run aborted
    Traceback (most recent call last):
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py", line 1127, in _run
        self._exception or resp)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 1294, in __call__
        return (yield from plan)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 1149, in baseline_wrapper
        return (yield from plan_mutator(plan, insert_baseline))
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 790, in monitor_during_wrapper
        return (yield from plan2)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 848, in fly_during_wrapper
        return (yield from plan2)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.ipython-bluesky/user/ad_acquire.py", line 41, in AD_Acquire
        yield from bps.mv(dm_pars.detNum, areadet.detector_number)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py", line 247, in mv
        yield Msg('wait', None, group=group)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      [Previous line repeated 2 more times]
    bluesky.utils.FailedStatus: Status(obj=EpicsSignal(read_pv='8idi:Reg2', name='dm_pars_detNum', parent='dm_pars', value=25, timestamp=1574705003.4290993, auto_monitor=False, string=False, write_pv='8idi:Reg2', limits=False, put_complete=False), done=True, success=False)
---------------------------------------------------------------------------
FailedStatus                              Traceback (most recent call last)
<ipython-input-7-fed21def4976> in <module>
----> 1 RE(AD_Acquire(lambdadet, .2, .2, 100, submit_xpcs_job=True, path="/data/2019-3/dufresne201911", file_name="X003_test"))

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py in __call__(self, *args, **metadata_kw)
    751                     # it (unless it is a canceled error)
    752                     if exc is not None:
--> 753                         raise exc
    754 
    755             if self._interrupted:

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py in _run(self)
   1260             self._reason = str(err)
   1261             self.log.exception("Run aborted")
-> 1262             raise err
   1263         finally:
   1264             # Some done_callbacks may still be alive in other threads.

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py in _run(self)
   1125                         try:
   1126                             msg = self._plan_stack[-1].throw(
-> 1127                                 self._exception or resp)
   1128                         except Exception as e:
   1129                             # The current plan did not handle it,

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py in __call__(self, plan)
   1292         plan = monitor_during_wrapper(plan, self.monitors)
   1293         plan = baseline_wrapper(plan, self.baseline)
-> 1294         return (yield from plan)

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py in baseline_wrapper(plan, devices, name)
   1147         return (yield from plan)
   1148     else:
-> 1149         return (yield from plan_mutator(plan, insert_baseline))
   1150 
   1151 

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc)
     75             # if we have a stashed exception, pass it along
     76             try:
---> 77                 msg = plan_stack[-1].throw(exception)
     78             except StopIteration as e:
     79                 # discard the exhausted generator

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py in monitor_during_wrapper(plan, signals)
    788     plan1 = plan_mutator(plan, insert_after_open)
    789     plan2 = plan_mutator(plan1, insert_before_close)
--> 790     return (yield from plan2)
    791 
    792 

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc)
     75             # if we have a stashed exception, pass it along
     76             try:
---> 77                 msg = plan_stack[-1].throw(exception)
     78             except StopIteration as e:
     79                 # discard the exhausted generator

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc)
     75             # if we have a stashed exception, pass it along
     76             try:
---> 77                 msg = plan_stack[-1].throw(exception)
     78             except StopIteration as e:
     79                 # discard the exhausted generator

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py in fly_during_wrapper(plan, flyers)
    846     plan1 = plan_mutator(plan, insert_after_open)
    847     plan2 = plan_mutator(plan1, insert_before_close)
--> 848     return (yield from plan2)
    849 
    850 

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc)
     75             # if we have a stashed exception, pass it along
     76             try:
---> 77                 msg = plan_stack[-1].throw(exception)
     78             except StopIteration as e:
     79                 # discard the exhausted generator

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc)
     75             # if we have a stashed exception, pass it along
     76             try:
---> 77                 msg = plan_stack[-1].throw(exception)
     78             except StopIteration as e:
     79                 # discard the exhausted generator

~/.ipython-bluesky/user/ad_acquire.py in AD_Acquire(areadet, acquire_time, acquire_period, num_images, file_name, submit_xpcs_job, atten, path)
     39 
     40     # select the detector's number
---> 41     yield from bps.mv(dm_pars.detNum, areadet.detector_number)
     42 
     43     yield from areadet.cam.setup_modes(num_images)

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py in mv(*args)
    245         ret = yield Msg('set', obj, val, group=group)
    246         status_objects.append(ret)
--> 247     yield Msg('wait', None, group=group)
    248     return tuple(status_objects)
    249 

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc)
    192         try:
    193             # yield out the 'current message' and collect the return
--> 194             inner_ret = yield msg
    195         except GeneratorExit:
    196             # special case GeneratorExit.  We must clean up all of our plans

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc)
    192         try:
    193             # yield out the 'current message' and collect the return
--> 194             inner_ret = yield msg
    195         except GeneratorExit:
    196             # special case GeneratorExit.  We must clean up all of our plans

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc)
    192         try:
    193             # yield out the 'current message' and collect the return
--> 194             inner_ret = yield msg
    195         except GeneratorExit:
    196             # special case GeneratorExit.  We must clean up all of our plans

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc)
    192         try:
    193             # yield out the 'current message' and collect the return
--> 194             inner_ret = yield msg
    195         except GeneratorExit:
    196             # special case GeneratorExit.  We must clean up all of our plans

~/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc)
    192         try:
    193             # yield out the 'current message' and collect the return
--> 194             inner_ret = yield msg
    195         except GeneratorExit:
    196             # special case GeneratorExit.  We must clean up all of our plans

FailedStatus: Status(obj=EpicsSignal(read_pv='8idi:Reg2', name='dm_pars_detNum', parent='dm_pars', value=25, timestamp=1574705003.4290993, auto_monitor=False, string=False, write_pv='8idi:Reg2', limits=False, put_complete=False), done=True, success=False)
@prjemian prjemian added the bug Something isn't working label Nov 25, 2019
@prjemian
Copy link
Collaborator Author

Then succeeds:

In [8]: RE(AD_Acquire(lambdadet, .2, .2, 100, submit_xpcs_job=True, path="/data/2019-3/dufresne201911", file_name="X003_test"))                                                                                         
I Mon-12:03:48 - AD_Acquire starting
I Mon-12:03:48 - file_path = /data/2019-3/dufresne201911/X003_test/
I Mon-12:03:51 - Lambda Operating Mode switched to: 1
I Mon-12:03:51 - Lambda DataType switched to: {data_type}
I Mon-12:03:51 - Shutter will *OPEN AND CLOSE DURING* the Acquisition...
I Mon-12:03:51 - calling inner()
I Mon-12:03:51 - before update_metadata_prescan()
I Mon-12:03:51 - detNum=25, det_pars={'ccdHardwareRowSize': 516, 'ccdHardwareColSize': 1556, 'ccdxsense': 1, 'ccdzsense': 1, 'harmonic': 1, 'dpix': 0.055, 'saturation': 4095, 'adupphot': 1.0, 'efficiency': 1.0, 'gain': 1, 'blemish': 1, 'flatfield': 1, 'distortion': 0, 'parasitic': 0, 'lld': 0, 'manufacturer': 'LAMBDA'}
I Mon-12:03:51 - dm_pars.root_folder
I Mon-12:03:51 - dm_pars.user_data_folder
I Mon-12:03:51 - dm_pars.data_folder
I Mon-12:03:51 - dm_pars.datafilename
I Mon-12:03:51 - dm_pars.source_begin_datetime
I Mon-12:03:51 - Reg 121, 101-110 done
I Mon-12:03:51 - Reg 111-120 done
I Mon-12:03:51 - Reg 123-127 done
I Mon-12:03:51 - after update_metadata_prescan()
I Mon-12:03:51 - scaler should be autocounting now
I Mon-12:03:51 - before count()
Transient Scan ID: 217     Time: 2019-11-25 12:03:51
Persistent Unique Scan ID: 'e678dabb-15e3-4d3a-92ed-c999aecd7de8'
New stream: 'baseline'
Start-of-run baseline readings:
+--------------------------------+--------------------------------+
+--------------------------------+--------------------------------+
New stream: 'aps_current_monitor'
New stream: 'Atten1_monitor'
New stream: 'Atten2_monitor'
New stream: 'lakeshore_loop1_temperature_monitor'
New stream: 'lakeshore_loop1_target_monitor'
lambdadet [In progress. No progress bar available.]                                                                                                                                                                     

@prjemian
Copy link
Collaborator Author

prjemian commented Dec 4, 2019

See similar issue (prjemian/ipython-vm7#1 (comment)). This has been discussed on Nikea/Slack. @tacaswell suggested:

That looks like we dropped a monitor on the floor.
If you put a small sleep before each monition does it still fail?
My other guess is that because you are restarting right after you finished and the de bounce logic is getting fooled

If the sleep works it will be because we l it is forcing is to win the race

@prjemian
Copy link
Collaborator Author

prjemian commented Dec 4, 2019

The problem happens with motor moves, perhaps in short succession to each other. Distilling this to a short macro with an added sleep before each move (run in an offline development system):

def back_and_forth():
    i = 0
    delay_s = 1e-2
    while True:
        i += 1

        yield from bps.sleep(delay_s)
        print(f"forward: {i}")
        yield from bps.mv(m1, 1)

        yield from bps.sleep(delay_s)
        print("backward")
        yield from bps.mv(m1, -1)

observed this failure (number of back & forth iterations until motor motion stalled) rate:

delay, s failure number
0 1
1e-6 2
1e-5 3
1e-4 75
1e-3 404
1e-2 -- (200+)
1e-1 --

@prjemian
Copy link
Collaborator Author

prjemian commented Dec 4, 2019

Note:

Ouch! The PV that failed here, 8idi:Reg2, is not a motor, just an ao record. Nonetheless, this issue and prjemian/ipython-vm7#1 share a common failure point at

      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg

Thus, the problem may not be in the ophyd support for epics_motor and positioner but perhaps in the EpicsSignal support, involving a status message.

@prjemian
Copy link
Collaborator Author

prjemian commented Dec 4, 2019

If that is the case, then we can modify the [back_and_forth()] to write to an EpicsSignal object instead. Should see the same problem, perhaps faster.

@prjemian
Copy link
Collaborator Author

prjemian commented Dec 4, 2019

test system 2

  1. EPICS database in file registers.db

     IOC:float1
    

    run database with softIoc -d registers.db

  2. connect with that PV in bluesky session:

pv = EpicsSignal("IOC:float1", name="pv")

def this_that(delay_s = 1e-2):
    i = 0
    while True:
        i += 1

        yield from bps.sleep(delay_s)
        print(f"this: {i}")
        yield from bps.mv(pv, 1)

        yield from bps.sleep(delay_s)
        print("that")
        yield from bps.mv(pv, -1)
  1. Run that plan

     RE(this_that(0), purpose="test new idea")
    
  2. Observation

    More than 75k iterations without a stall.

@prjemian
Copy link
Collaborator Author

prjemian commented Dec 4, 2019

Note: When the scan was stopped with ^C^C and then RE.abort(), the exception trace reported ends with the same lines:

      File "<ipython-input-12-f1de86ba6da4>", line 10, in this_that
        yield from bps.mv(pv, 1)
      File "/home/mintadmin/Apps/anaconda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py", line 247, in mv
        yield Msg('wait', None, group=group)
      File "/home/mintadmin/Apps/anaconda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      File "/home/mintadmin/Apps/anaconda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      File "/home/mintadmin/Apps/anaconda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      [Previous line repeated 1 more time]
    bluesky.utils.RequestAbort

Conclusion search for the stalled wait state may be a red herring and this issue may be different from prjemian/ipython-vm7#1 after all.

@prjemian
Copy link
Collaborator Author

prjemian commented Dec 4, 2019

Puzzled how to proceed on this. Resigned to wait until we discover this problem during upcoming beam time.

@prjemian
Copy link
Collaborator Author

I believe this FailedStatus exception is directly related to the TimeoutError reported to ophyd and may also be related to another set() call in progress as reported by the APS USAXS instrument. My first assumption is these may be common to the interface between ophyd and pyepics. We will check with the APS BCDA group for experience with EPICS clients (such as PyEpics and SPEC) and similar problems. A combination of retries and direct ophyd handling of the various timeouts (connection, get, put) is the first hypothesis to resolve this issue.

@prjemian
Copy link
Collaborator Author

For now, focus my comments into the ophyd TimeoutError issue.

@prjemian
Copy link
Collaborator Author

Also, the motor stalled at end of move error is now posted to ophyd: bluesky/ophyd#782

@prjemian
Copy link
Collaborator Author

This can be diagnosed outside of beam time. Moving to next milestone.

@prjemian
Copy link
Collaborator Author

If this is truly a problem with handling EpicsSignal, then we can test with one of the general purpose numerical register PVs used by the beam line. They follow the pattern of 8idi:RegNNN where NNN is a number from 1-200. Many of these are pre-allocated for a specific use so we should use one of them that is not pre-allocated. The beam line allocates them in order, so we'll pick 8idi:Reg200 for our testing and use a variation of the ping_pong plan.

i = 0
def ping_pong(motor, v1, v2, delay_s=1e-2):
    global i
    yield from move(motor, f"ping {i+1}", v1, delay_s)
    yield from move(motor, f"pong {i+1}", v2, delay_s)
i += 1

prjemian pushed a commit that referenced this issue Dec 19, 2019
@prjemian
Copy link
Collaborator Author

Found the FailedStatus exception on the second try in cycle 420 during the pong step:

(bluesky) 8idiuser@quartz ~/.../2019/Dec2019 $ ./issue114.py 
ping 1:  0.1 0.1
pong 1:  -0.1 -0.1
ping 2:  0.1 0.1
pong 2:  -0.1 -0.1
ping 3:  0.1 0.1
pong 3:  -0.1 -0.1
ping 4:  0.1 0.1
pong 4:  -0.1 -0.1
ping 5:  0.1 0.1
pong 5:  -0.1 -0.1
ping 6:  0.1 0.1
pong 6:  -0.1 -0.1
ping 7:  0.1 0.1
pong 7:  -0.1 -0.1
ping 8:  0.1 0.1
pong 8:  -0.1 -0.1
ping 9:  0.1 0.1
pong 9:  -0.1 -0.1
ping 10:  0.1 0.1
pong 10:  -0.1 -0.1
(bluesky) 8idiuser@quartz ~/.../2019/Dec2019 $ ./issue114.py 1000
ping 1:  0.1 0.1
pong 1:  -0.1 -0.1
ping 2:  0.1 0.1
pong 2:  -0.1 -0.1
ping 3:  0.1 0.1
pong 3:  -0.1 -0.1
ping 4:  0.1 0.1
pong 4:  -0.1 -0.1
ping 5:  0.1 0.1
pong 5:  -0.1 -0.1
ping 6:  0.1 0.1
pong 6:  -0.1 -0.1
ping 7:  0.1 0.1
pong 7:  -0.1 -0.1
ping 8:  0.1 0.1
pong 8:  -0.1 -0.1
ping 9:  0.1 0.1
pong 9:  -0.1 -0.1
...
ping 418:  0.1 0.1
pong 418:  -0.1 -0.1
ping 419:  0.1 0.1
pong 419:  -0.1 -0.1
ping 420:  0.1 0.1
/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/epics/ca.py:1459: UserWarning: ca.get('8idi:Reg200') timed out after 1.00 seconds.
  warnings.warn(msg % (name(chid), timeout))
[E 12:47:04.515 run_engine:1261] Run aborted
    Traceback (most recent call last):
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py", line 1127, in _run
        self._exception or resp)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 1294, in __call__
        return (yield from plan)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 1147, in baseline_wrapper
        return (yield from plan)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 790, in monitor_during_wrapper
        return (yield from plan2)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 848, in fly_during_wrapper
        return (yield from plan2)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py", line 859, in repeater
        yield from gen_func(*args, **kwargs)
      File "./issue114.py", line 65, in ping_pong
        yield from move(signal, f"pong {i+1}", v2, delay_s)
      File "./issue114.py", line 55, in move
        yield from bps.mv(signal, dest)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py", line 247, in mv
        yield Msg('wait', None, group=group)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      [Previous line repeated 1 more time]
    bluesky.utils.FailedStatus: Status(obj=EpicsSignal(read_pv='8idi:Reg200', name='pv', value=-0.1, timestamp=1576781223.5102763, tolerance=0.0001, auto_monitor=False, string=False, write_pv='8idi:Reg200', limits=False, put_complete=False), done=True, success=False)
Traceback (most recent call last):
  File "./issue114.py", line 77, in <module>
    delay_s=DELAY_S,
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py", line 753, in __call__
    raise exc
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py", line 1262, in _run
    raise err
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py", line 1127, in _run
    self._exception or resp)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 1294, in __call__
    return (yield from plan)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 1147, in baseline_wrapper
    return (yield from plan)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 790, in monitor_during_wrapper
    return (yield from plan2)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
    msg = plan_stack[-1].throw(exception)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
    msg = plan_stack[-1].throw(exception)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 848, in fly_during_wrapper
    return (yield from plan2)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
    msg = plan_stack[-1].throw(exception)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
    msg = plan_stack[-1].throw(exception)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py", line 859, in repeater
    yield from gen_func(*args, **kwargs)
  File "./issue114.py", line 65, in ping_pong
    yield from move(signal, f"pong {i+1}", v2, delay_s)
  File "./issue114.py", line 55, in move
    yield from bps.mv(signal, dest)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py", line 247, in mv
    yield Msg('wait', None, group=group)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
    inner_ret = yield msg
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
    inner_ret = yield msg
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
    inner_ret = yield msg
  [Previous line repeated 1 more time]
bluesky.utils.FailedStatus: Status(obj=EpicsSignal(read_pv='8idi:Reg200', name='pv', value=-0.1, timestamp=1576781223.5102763, tolerance=0.0001, auto_monitor=False, string=False, write_pv='8idi:Reg200', limits=False, put_complete=False), done=True, success=False)

@prjemian
Copy link
Collaborator Author

Repeated, same thing during cycle 598, pong step. Convinced problem is not limited to EpicsMotor.

@prjemian
Copy link
Collaborator Author

In ophyd.signal.EpicsSignalBase.get(), changing this line:

 info = self._read_pv.get_with_metadata(as_string=as_string, form=form, 
**kwargs) 

to:

        try:
            warnings.filterwarnings("error")
            t0 = time.time()
            info = self._read_pv.get_with_metadata(
                        as_string=as_string, form=form, **kwargs) 
        except UserWarning as exc:
            print(f"while getting from {self._read_pvname}, caught '{exc}'")
            info = None
        finally:
            dt = time.time() - t0
            warnings.filterwarnings("default")

results in this response (which identifies the timeout from PyEpics has been trapped):

ping 466:  0.1 0.1
pong 466:  -0.1 -0.1
ping 467:  0.1 0.1
while getting from 8idi:Reg200, caught 'ca.get('8idi:Reg200') timed out after 1.00 seconds.'
[E 13:46:05.808 run_engine:1261] Run aborted
    Traceback (most recent call last):
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py", line 1127, in _run
        self._exception or resp)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 1294, in __call__
        return (yield from plan)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 1147, in baseline_wrapper
        return (yield from plan)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 790, in monitor_during_wrapper
        return (yield from plan2)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 848, in fly_during_wrapper
        return (yield from plan2)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py", line 859, in repeater
        yield from gen_func(*args, **kwargs)
      File "./issue114.py", line 67, in ping_pong
        yield from move(signal, f"pong {i+1}", v2, delay_s)
      File "./issue114.py", line 57, in move
        yield from bps.mv(signal, dest)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py", line 247, in mv
        yield Msg('wait', None, group=group)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      [Previous line repeated 1 more time]
    bluesky.utils.FailedStatus: Status(obj=EpicsSignal(read_pv='8idi:Reg200', name='pv', value=-0.1, timestamp=1576784764.8081648, tolerance=0.0001, auto_monitor=False, string=False, write_pv='8idi:Reg200', limits=False, put_complete=False), done=True, success=False)
Traceback (most recent call last):
  File "./issue114.py", line 79, in <module>
    delay_s=DELAY_S,
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py", line 753, in __call__
    raise exc
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py", line 1262, in _run
    raise err
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py", line 1127, in _run
    self._exception or resp)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 1294, in __call__
    return (yield from plan)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 1147, in baseline_wrapper
    return (yield from plan)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 790, in monitor_during_wrapper
    return (yield from plan2)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
    msg = plan_stack[-1].throw(exception)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
    msg = plan_stack[-1].throw(exception)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 848, in fly_during_wrapper
    return (yield from plan2)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
    msg = plan_stack[-1].throw(exception)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
    msg = plan_stack[-1].throw(exception)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py", line 859, in repeater
    yield from gen_func(*args, **kwargs)
  File "./issue114.py", line 67, in ping_pong
    yield from move(signal, f"pong {i+1}", v2, delay_s)
  File "./issue114.py", line 57, in move
    yield from bps.mv(signal, dest)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py", line 247, in mv
    yield Msg('wait', None, group=group)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
    inner_ret = yield msg
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
    inner_ret = yield msg
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
    inner_ret = yield msg
  [Previous line repeated 1 more time]
bluesky.utils.FailedStatus: Status(obj=EpicsSignal(read_pv='8idi:Reg200', name='pv', value=-0.1, timestamp=1576784764.8081648, tolerance=0.0001, auto_monitor=False, string=False, write_pv='8idi:Reg200', limits=False, put_complete=False), done=True, success=False)

@prjemian
Copy link
Collaborator Author

final version (traps the Timeout from PyEpics and raises TimeoutError):

        try:
            warnings.filterwarnings("error")
            t0 = time.time()
            info = self._read_pv.get_with_metadata(
                        as_string=as_string, form=form, **kwargs) 
        except UserWarning as exc:
            timeout = kwargs.get('timeout', None)
            raise TimeoutError('Failed to read {} within: {} sec: {}'
                               .format(self._read_pvname, timeout, exc))
        finally:
            dt = time.time() - t0
            warnings.filterwarnings("default")

Still ends with a FailedStatus exception

exception traceback

pong 1782:  -0.1 -0.1
ping 1783:  0.1 0.1
pong 1783:  -0.1 -0.1
ping 1784:  0.1 0.1
[E 14:01:03.091 run_engine:1261] Run aborted
    Traceback (most recent call last):
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py", line 1127, in _run
        self._exception or resp)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 1294, in __call__
        return (yield from plan)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 1147, in baseline_wrapper
        return (yield from plan)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 790, in monitor_during_wrapper
        return (yield from plan2)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 848, in fly_during_wrapper
        return (yield from plan2)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
        msg = plan_stack[-1].throw(exception)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py", line 859, in repeater
        yield from gen_func(*args, **kwargs)
      File "./issue114.py", line 67, in ping_pong
        yield from move(signal, f"pong {i+1}", v2, delay_s)
      File "./issue114.py", line 57, in move
        yield from bps.mv(signal, dest)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py", line 247, in mv
        yield Msg('wait', None, group=group)
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
        inner_ret = yield msg
      [Previous line repeated 1 more time]
    bluesky.utils.FailedStatus: Status(obj=EpicsSignal(read_pv='8idi:Reg200', name='pv', value=-0.1, timestamp=1576785662.091077, tolerance=0.0001, auto_monitor=False, string=False, write_pv='8idi:Reg200', limits=False, put_complete=False), done=True, success=False)
Traceback (most recent call last):
  File "./issue114.py", line 79, in <module>
    delay_s=DELAY_S,
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py", line 753, in __call__
    raise exc
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py", line 1262, in _run
    raise err
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py", line 1127, in _run
    self._exception or resp)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 1294, in __call__
    return (yield from plan)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 1147, in baseline_wrapper
    return (yield from plan)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 790, in monitor_during_wrapper
    return (yield from plan2)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
    msg = plan_stack[-1].throw(exception)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
    msg = plan_stack[-1].throw(exception)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 848, in fly_during_wrapper
    return (yield from plan2)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
    msg = plan_stack[-1].throw(exception)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 77, in plan_mutator
    msg = plan_stack[-1].throw(exception)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py", line 859, in repeater
    yield from gen_func(*args, **kwargs)
  File "./issue114.py", line 67, in ping_pong
    yield from move(signal, f"pong {i+1}", v2, delay_s)
  File "./issue114.py", line 57, in move
    yield from bps.mv(signal, dest)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/plan_stubs.py", line 247, in mv
    yield Msg('wait', None, group=group)
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
    inner_ret = yield msg
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
    inner_ret = yield msg
  File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/preprocessors.py", line 194, in plan_mutator
    inner_ret = yield msg
  [Previous line repeated 1 more time]
bluesky.utils.FailedStatus: Status(obj=EpicsSignal(read_pv='8idi:Reg200', name='pv', value=-0.1, timestamp=1576785662.091077, tolerance=0.0001, auto_monitor=False, string=False, write_pv='8idi:Reg200', limits=False, put_complete=False), done=True, success=False)

@prjemian
Copy link
Collaborator Author

There is still code that is dropping the message content of this TimeoutError as it fails the status object. Need to find that next.

@prjemian
Copy link
Collaborator Author

still blogging here ...

turning on debugging in the issue114.py

import ophyd
# logger = logging.getLogger('ophyd.event_dispatcher')
logger = logging.getLogger('ophyd.signal')
logger.setLevel("DEBUG")

shows that Timeout is received:

pong 109:  -0.1 -0.1
[D 14:17:45.222 signal:243] set_and_wait('pv', 0.1) succeeded => 0.1
ping 110:  0.1 0.1
[D 14:17:45.228 signal:243] set_and_wait('pv', -0.1) succeeded => -0.1
pong 110:  -0.1 -0.1
[D 14:17:45.232 signal:243] set_and_wait('pv', 0.1) succeeded => 0.1
ping 111:  0.1 0.1
[D 14:17:45.238 signal:243] set_and_wait('pv', -0.1) succeeded => -0.1
pong 111:  -0.1 -0.1
[D 14:17:46.239 signal:235] set_and_wait('pv', 0.1) timed out
[E 14:17:46.240 run_engine:1261] Run aborted
    Traceback (most recent call last):

from ophyd.signal.Signal.set_thread:

        def set_thread():
            try:
                set_and_wait(self, value, timeout=timeout, atol=self.tolerance,
                             rtol=self.rtolerance)
            except TimeoutError:
                self.log.debug('set_and_wait(%r, %s) timed out', self.name,
                               value)
                success = False
            except Exception as ex:
                self.log.exception('set_and_wait(%r, %s) failed',
                                   self.name, value)
                success = False
            else:
                self.log.debug('set_and_wait(%r, %s) succeeded => %s',
                               self.name, value, self.value)
                success = True
                if settle_time is not None:
                    time.sleep(settle_time)
            finally:
                # keep a local reference to avoid any GC shenanigans
                th = self._set_thread
                # these two must be in this order to avoid a race condition
                self._set_thread = None
                st._finished(success=success)
                del th

and we see here that the exception message is being ignored. It would be useful to pass it along (via self.log) in this case.

@prjemian
Copy link
Collaborator Author

Changing the exception handling slightly to report the message:

            except TimeoutError as exc:
                self.log.debug('set_and_wait(%r, %s) timed out: %s', self.name,
                               value, exc)

now captures the message content:

[D 14:23:39.620 signal:243] set_and_wait('pv', -0.1) succeeded => -0.1
pong 1572:  -0.1 -0.1
[D 14:23:39.624 signal:243] set_and_wait('pv', 0.1) succeeded => 0.1
ping 1573:  0.1 0.1
[D 14:23:39.628 signal:243] set_and_wait('pv', -0.1) succeeded => -0.1
pong 1573:  -0.1 -0.1
[D 14:23:40.629 signal:235] set_and_wait('pv', 0.1) timed out: Failed to read 8idi:Reg200 within: None sec: ca.get('8idi:Reg200') timed out after 1.00 seconds.
[E 14:23:40.629 run_engine:1261] Run aborted
    Traceback (most recent call last):
      File "/home/beams/8IDIUSER/.conda/envs/bluesky/lib/python3.7/site-packages/bluesky/run_engine.py", line 1127, in _run
        self._exception or resp)

@tacaswell: Is this an example of that None error you described?

@prjemian
Copy link
Collaborator Author

This is the code that produces the misleading report of None seconds:

        timeout = kwargs.get('timeout', None)

My contention is that ophyd should take control of the timeout value and also implement retries. Start with a short timeout, keep retrying (to read the PV) with progressively longer timeouts until maximum waiting time has elapsed. Only then, report a TimeoutError.

@prjemian
Copy link
Collaborator Author

@tacaswell, @klauer, @danielballan, @mrakitin: victory in sight

Looks messy, needs cleanup, but here is a revision to ophyd.signal.EpicsSignalBase.get() that employs retries and an escalating retry timeout period. Bottom line, it works (no failures in 1000 ping_ping cycles)!

code revision to get()

    def prj_get(self, *, as_string=None, form='time', 
                max_wait_time_s=3.0,
                max_retries=10,
                **kwargs):
        counter = 0
        retry_timeout = kwargs.get('timeout', 0.05)
        scale_factor = max_wait_time_s / exp(max_retries-1)  # t seconds for n retries

        t0 = time.time()
        warnings.filterwarnings("error")
        while time.time() - t0 < max_wait_time_s:
            # What if previously-connected PV has disconnected?
            # Do we need to check here or does PyEpics raise a warning or exception for this?
            try:
                info = self._read_pv.get_with_metadata(
                            as_string=as_string, form=form, 
                            timeout=retry_timeout, **kwargs) 
                return info
            except UserWarning as exc:
                # timeout = kwargs.get('timeout', None)
                # raise TimeoutError('Failed to read {} within: {} sec: {}'
                #                 .format(self._read_pvname, timeout, exc))
                info = None
                print(exc)
            finally:
                counter += 1
                retry_timeout = min(
                    max_wait_time_s - (time.time() - t0),        # limit to max_wait_time_s
                    retry_timeout + scale_factor * exp(counter)  # progressively longer wait time
                )

        dt = time.time() - t0
        warnings.filterwarnings("default")
        if counter > 0:
            self.log.debug('get({}) pv={} retries={} in {:.3f} sec'
                           .format(self.name, self._read_pvname, counter, dt))
        if info is None:
            raise TimeoutError('Failed to read {} within: {} sec'
                               .format(self._read_pvname, dt))

    def get(self, *, as_string=None, connection_timeout=1.0, form='time',
            **kwargs):
        '''Get the readback value through an explicit call to EPICS

        Parameters
        ----------
        count : int, optional
            Explicitly limit count for array data
        as_string : bool, optional
            Get a string representation of the value, defaults to as_string
            from this signal, optional
        as_numpy : bool
            Use numpy array as the return type for array data.
        timeout : float, optional
            maximum time to wait for value to be received.
            (default = 0.5 + log10(count) seconds)
        use_monitor : bool, optional
            to use value from latest monitor callback or to make an
            explicit CA call for the value. (default: True)
        connection_timeout : float, optional
            If not already connected, allow up to `connection_timeout` seconds
            for the connection to complete.
        form : {'time', 'ctrl'}
            PV form to request
        '''
        if as_string is None:
            as_string = self._string

        self.wait_for_connection(timeout=connection_timeout)
        info = self.prj_get(**kwargs)
        # try:
        #     warnings.filterwarnings("error")
        #     t0 = time.time()
        #     info = self._read_pv.get_with_metadata(
        #                 as_string=as_string, form=form, **kwargs) 
        # except UserWarning as exc:
        #     timeout = kwargs.get('timeout', None)
        #     raise TimeoutError('Failed to read {} within: {} sec: {}'
        #                        .format(self._read_pvname, timeout, exc))
        # finally:
        #     dt = time.time() - t0
        #     warnings.filterwarnings("default")

        if info is None:
            timeout = kwargs.get('timeout', None)
            raise TimeoutError('Failed to read {} within: {} sec}'
                               .format(self._read_pvname, timeout))

        value = info.pop('value')
        if as_string:
            value = waveform_to_string(value)

        has_monitor = self._monitors[self.pvname] is not None
        if form != 'time' or not has_monitor:
            # Different form such as 'ctrl' holds additional data not available
            # through the DBR_TIME channel access monitor
            self._metadata_changed(self.pvname, info, update=True,
                                   require_timestamp=True, from_monitor=False)

        if not has_monitor:
            # No monitor - readback can only be updated here
            self._readback = value
        return self._fix_type(value)

The retry handler caught the occasional timeout from PyEpics but eventually succeeded within the default wait period of 5 s. Here is an example:

ping 9659:  0.1 0.1
[D 15:58:33.922 signal:244] set_and_wait('pv', -0.1) succeeded => -0.1
pong 9659:  -0.1 -0.1
[D 15:58:33.926 signal:244] set_and_wait('pv', 0.1) succeeded => 0.1
ping 9660:  0.1 0.1
ca.get('8idi:Reg200') timed out after 0.05 seconds.
[D 15:58:34.009 signal:244] set_and_wait('pv', -0.1) succeeded => -0.1
pong 9660:  -0.1 -0.1
[D 15:58:34.013 signal:244] set_and_wait('pv', 0.1) succeeded => 0.1
ping 9661:  0.1 0.1

and one with more retries:

pong 8208:  -0.1 -0.1
[D 15:58:12.186 signal:244] set_and_wait('pv', 0.1) succeeded => 0.1
ping 8209:  0.1 0.1
[D 15:58:12.190 signal:244] set_and_wait('pv', -0.1) succeeded => -0.1
pong 8209:  -0.1 -0.1
[D 15:58:12.194 signal:244] set_and_wait('pv', 0.1) succeeded => 0.1
ping 8210:  0.1 0.1
ca.get('8idi:Reg200') timed out after 0.05 seconds.
ca.get('8idi:Reg200') timed out after 0.05 seconds.
ca.get('8idi:Reg200') timed out after 0.05 seconds.
ca.get('8idi:Reg200') timed out after 0.06 seconds.
ca.get('8idi:Reg200') timed out after 0.08 seconds.
ca.get('8idi:Reg200') timed out after 0.14 seconds.
ca.get('8idi:Reg200') timed out after 0.29 seconds.
[D 15:58:13.256 signal:244] set_and_wait('pv', -0.1) succeeded => -0.1
pong 8210:  -0.1 -0.1
[D 15:58:13.291 signal:244] set_and_wait('pv', 0.1) succeeded => 0.1
ping 8211:  0.1 0.1
[D 15:58:13.299 signal:244] set_and_wait('pv', -0.1) succeeded => -0.1
pong 8211:  -0.1 -0.1
[D 15:58:13.306 signal:244] set_and_wait('pv', 0.1) succeeded => 0.1
ping 8212:  0.1 0.1

total retry time here was 1.02 s

@prjemian
Copy link
Collaborator Author

No failures in 10,000 cycles.

@prjemian
Copy link
Collaborator Author

Next need to test this on my troublesome VM.

@prjemian
Copy link
Collaborator Author

Another test at 8-ID-I, 10,000 cycles through ping_pong plan yields these 7 successful retry sequences:

[D 16:35:04.800 signal:817] get(pv) pv=8idi:Reg200 retries=8 in 2.268 sec
[D 16:35:37.810 signal:817] get(pv) pv=8idi:Reg200 retries=7 in 1.092 sec
[D 16:36:03.993 signal:817] get(pv) pv=8idi:Reg200 retries=7 in 1.155 sec
[D 16:36:11.045 signal:817] get(pv) pv=8idi:Reg200 retries=7 in 1.059 sec
[D 16:36:17.781 signal:817] get(pv) pv=8idi:Reg200 retries=1 in 0.067 sec
[D 16:36:44.015 signal:817] get(pv) pv=8idi:Reg200 retries=7 in 1.094 sec
[D 16:37:04.106 signal:817] get(pv) pv=8idi:Reg200 retries=7 in 1.119 sec

@tacaswell
Copy link

Do we have any understanding of why reads sometimes take several seconds?

Do we get any benefit from re-trying that we don't get by just setting the default timeout to 5s or so?

@tacaswell
Copy link

Thinking through the things that could be going wrong:

  • python main thread is locked up -> no, because we can retry
  • race condition in pyepics where it handles the book keeping (I sent out a read request, need to wait for it to come back) -> no, because we see retry counts > 2. If this were the problem it would need to be rare enough that it normally works that getting it 8 times in a row is very low probability
  • IOC falling over -> no, because we don't get channel death warnings
  • tcp dropping packets -> lets not go down that route until we have to
  • network is getting bogged down and delaying delivery -> maybe, but worrying. However, if this is the problem, retrying may just be making the problem worse
  • the IOC is bogged down / locked up -> I don't think the QoS on the protocol lets the IOCs chose to not service read requests (the way it is allowed to drop monitors) so presumably the original read will be serviced eventually and polling will only make the problem worse
  • we are not calling pend_event in pyepics and the messages are sitting in a libca buffer -> no because the timeout in pyepics comes from is a hot(ish) loop with a pyepics.ca.poll()
  • something is going wrong in libca which mean pend_event is not doing what I think it is -> maybe, but given how core this is to libca that seems unlikely.

Would it be possible to re-run the same test with caproto rather than pyepics? That will help determine what side of the client network stack the problem is on. caproto also has the hooks where we can track the timing of the read requests being serviced. We should be able to check if the original read is being serviced, just very slowly, or if is is being dropped on the floor.

It would be interesting to try hitting a different PV on the same IOC while one of them is timing out. Or to quickly enough get a different process (or to get a different channel) to try talking to it over. This wolud help sort out if this is a network issue (our packets are getting lost) or if the problem is in the IOC not severvicing any requests.

Can we get any telemetry from the IOC to see if it is stalling?

The next next step is wire shark.

@danielballan
Copy link

Switching to the ophyd caproto backend might raise some red-herring issues, as it could have problems of its own. I would suggest instead staying focused on the pyepics backend and using caproto-shark or Davidsaver's wireshark plugn to observe the network traffic from the outside.

@prjemian
Copy link
Collaborator Author

prjemian commented Jan 6, 2020

Do we get any benefit from re-trying that we don't get by just setting the default timeout to 5s or so?

Longer timeout will not resolve problem of dropped response. Ask the IOC again has worked for years.

These are not satisfying answers yet the years of success of this empirical method show one way to work past this problem.

The place to try a solution to this is locally (for the XPCS instrument), rather than affecting all ophyd users.

@prjemian
Copy link
Collaborator Author

prjemian commented Jan 6, 2020

About retries, see these comments. Is there any evidence on the console of a Virtual circuit unresponsive message, contemporary with the FailedStatus exception?

CA.Client.Exception...............................................
    Warning: "Virtual circuit unresponsive"
    Context: "ioc8idi1.xray.aps.anl.gov:5064"
    Source File: ../tcpiiu.cpp line 920
    Current Time: Sun Dec 22 2019 13:15:42.498961135
Traceback (most recent call last):
..................................................................```

@prjemian
Copy link
Collaborator Author

prjemian commented Jan 6, 2020

Do we have any understanding of why reads sometimes take several seconds?

Following this advice, is this additional, unexpected delay the result of a new PV name search, per what is written in the Channel Access Protocol Specification (https://epics.anl.gov/base/R3-16/0-docs/CAproto).

@prjemian
Copy link
Collaborator Author

This exception has not not occurred in recent testing of #127. Closing this issue, based on conclusions from: #124 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants