-
Notifications
You must be signed in to change notification settings - Fork 1
FailedStatus exception during AD_Acquire #114
Comments
Then succeeds:
|
See similar issue (prjemian/ipython-vm7#1 (comment)). This has been discussed on Nikea/Slack. @tacaswell suggested:
|
The problem happens with motor moves, perhaps in short succession to each other. Distilling this to a short macro with an added sleep before each move (run in an offline development system):
observed this failure (number of back & forth iterations until motor motion stalled) rate:
|
@tacaswell also said:
|
Note:Ouch! The PV that failed here,
Thus, the problem may not be in the ophyd support for epics_motor and positioner but perhaps in the EpicsSignal support, involving a status message. |
If that is the case, then we can modify the [ |
test system 2
|
Note: When the scan was stopped with
Conclusion search for the stalled wait state may be a red herring and this issue may be different from prjemian/ipython-vm7#1 after all. |
Puzzled how to proceed on this. Resigned to wait until we discover this problem during upcoming beam time. |
I believe this |
For now, focus my comments into the ophyd |
Also, the motor stalled at end of move error is now posted to ophyd: bluesky/ophyd#782 |
This can be diagnosed outside of beam time. Moving to next milestone. |
If this is truly a problem with handling EpicsSignal, then we can test with one of the general purpose numerical register PVs used by the beam line. They follow the pattern of
|
Found the
|
Repeated, same thing during cycle 598, pong step. Convinced problem is not limited to EpicsMotor. |
In
to:
results in this response (which identifies the timeout from PyEpics has been trapped):
|
final version (traps the Timeout from PyEpics and raises
Still ends with a exception traceback
|
There is still code that is dropping the message content of this TimeoutError as it fails the status object. Need to find that next. |
still blogging here ... turning on debugging in the issue114.py
shows that Timeout is received:
from
and we see here that the exception message is being ignored. It would be useful to pass it along (via |
Changing the exception handling slightly to report the message:
now captures the message content:
@tacaswell: Is this an example of that |
This is the code that produces the misleading report of
My contention is that ophyd should take control of the timeout value and also implement retries. Start with a short timeout, keep retrying (to read the PV) with progressively longer timeouts until maximum waiting time has elapsed. Only then, report a TimeoutError. |
@tacaswell, @klauer, @danielballan, @mrakitin: victory in sight Looks messy, needs cleanup, but here is a revision to code revision to get()
The retry handler caught the occasional timeout from PyEpics but eventually succeeded within the default wait period of 5 s. Here is an example:
and one with more retries:
total retry time here was 1.02 s |
No failures in 10,000 cycles. |
Next need to test this on my troublesome VM. |
Another test at 8-ID-I, 10,000 cycles through ping_pong plan yields these 7 successful retry sequences:
|
Do we have any understanding of why reads sometimes take several seconds? Do we get any benefit from re-trying that we don't get by just setting the default timeout to 5s or so? |
Thinking through the things that could be going wrong:
Would it be possible to re-run the same test with caproto rather than pyepics? That will help determine what side of the client network stack the problem is on. caproto also has the hooks where we can track the timing of the read requests being serviced. We should be able to check if the original read is being serviced, just very slowly, or if is is being dropped on the floor. It would be interesting to try hitting a different PV on the same IOC while one of them is timing out. Or to quickly enough get a different process (or to get a different channel) to try talking to it over. This wolud help sort out if this is a network issue (our packets are getting lost) or if the problem is in the IOC not severvicing any requests. Can we get any telemetry from the IOC to see if it is stalling? The next next step is wire shark. |
Switching to the ophyd caproto backend might raise some red-herring issues, as it could have problems of its own. I would suggest instead staying focused on the pyepics backend and using |
Longer timeout will not resolve problem of dropped response. Ask the IOC again has worked for years. These are not satisfying answers yet the years of success of this empirical method show one way to work past this problem. The place to try a solution to this is locally (for the XPCS instrument), rather than affecting all ophyd users. |
About retries, see these comments. Is there any evidence on the console of a
|
Following this advice, is this additional, unexpected delay the result of a new PV name search, per what is written in the Channel Access Protocol Specification (https://epics.anl.gov/base/R3-16/0-docs/CAproto). |
This exception has not not occurred in recent testing of #127. Closing this issue, based on conclusions from: #124 (comment) |
This failed (but succeeded when we just ran it again). Example of an error that comes seemingly at random.
The text was updated successfully, but these errors were encountered: