Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

use bluesky, et al., for at least one user #46

Closed
prjemian opened this issue Sep 4, 2019 · 17 comments
Closed

use bluesky, et al., for at least one user #46

prjemian opened this issue Sep 4, 2019 · 17 comments
Assignees
Labels
enhancement New feature or request

Comments

@prjemian
Copy link
Collaborator

prjemian commented Sep 4, 2019

run user operations for at least 1 week with an expert user without having to resort back to SPEC in the middle of the week

@prjemian
Copy link
Collaborator Author

prjemian commented Dec 6, 2019

We had a good 10 hours of work yesterday with bluesky for the XPCS instrument using the Lambda detector. Bluesky was used for most of the instrument preparation (such as alignment scans and the routine remote control of various equipment). I believe we were able to complete a few successful measurements for the user.

Part of the day was spent developing a user plan that would measure at a list of temperatures (using the Lakeshore controller). This plan would automate an overnight sequence of measurements. Our work was plagued with many interruptions of the RunEngine execution of our plan due to TimeoutError exceptions (such as #114) from various EpicsSignal components. These signals had been read in previous steps (such as reading number 47 in a long scan, meaning that at least 46 reads had been successful). A common factor between the different EPICS PVs underlying these signals was not obvious. Different fields of EPICS records were involved such as .EGU, .VAL, .TDIR and different types of EPICS records (motor, ao, swait). Unfortunately, the console output was not preserved (next time, we'll save that).

Also, we experienced some failures to connect with one EPICS PV or another at the start of a bluesky session using ipython. The remedy has been to exit and restart. This usually worked. A few times, a different PV was the cause of a similar failure to connect.

Summary

We agree that the bluesky framework is not ready now for unattended operations at XPCS.

There are too many interruptions (Python exceptions which interrupt the bluesky RunEngine) and it seems these interruptions are at a low level in the framework (somewhere in the handling of EpicsSignal objects). Resolving these interruptions requires the full attention of experts at diagnosing at this low level. Since the XPCS instrument is now in a routine operations period and the beam line has a working alternative software, they decided to switch back to using SPEC for the remainder of the user's beam time.

The exception traces are too deep, in most cases. This is an opportunity to improve the exception handling in the RunEngine. When a problem is due to a user's plan, the exception is raised from a deep level in the bluesky framework and passed directly to the console. We expect such exceptions should be caught by the RunEngine, then raised as a new PlanException with the focus on the user plan step that provoked the exception. A terse mention in the PlanException would mention the specific deeper exception without the verbose trace. To preserve valuable diagnostics, the RunEngine could log separately the deeper exception reports.

TODO list

  • resolve the TimeoutError (such as FailedStatus exception during AD_Acquire #114) before the next scheduled operations
  • schedule XPCS beam time with bluesky operations as the top priority. Needs a friendly user to cooperate.
  • be clear before the beam time on what tasks to be accomplished so we do not spend beam time developing & testing new software.
  • save all the console output when such exceptions arise

@danielballan
Copy link

Thanks for the thorough write up. It sounds like there are two categories of problem: (1) something is causing TimeoutErrors and interrupting scans and (2) the deep tracebacks present a usability issue.

@kmpeters
Copy link

kmpeters commented Dec 6, 2019

I've seen TimeoutError exceptions in fly-scan python code using pyepics running on Windows at 34ID-E. The problem occurred infrequently (less frequently than once every few weeks). The PV that was being accessed at the time of the timeout would change each time. Sometimes the timeout was on a get and other times it was on put. The PV that was being accessed when the exception occurred was always available via the command line immediately after the exception occurred. I wrote safer put and get functions that handled the timeout exception and printed info about the problem and I haven't heard from users about the problem since then.

I don't remember the python, pyepics3, and EPICS lib versions. I can look them up on the next shutdown day.

@prjemian
Copy link
Collaborator Author

prjemian commented Dec 6, 2019

Thanks! This is valuable input, helping us to divide our problem space of where to look next.

@mrakitin
Copy link

mrakitin commented Dec 7, 2019

Also, we experienced some failures to connect with one EPICS PV or another at the start of a bluesky session using ipython. The remedy has been to exit and restart. This usually worked. A few times, a different PV was the cause of a similar failure to connect.

We noticed that problem at a few "heavy" (in terms of the number of PVs) beamlines. I think we need to come up with a common solution with longer allowed connection times. A temporary, in-place fix can be similar to what we use in our CI testing, i.e. monkey-patching the timeout of the EpicsSignal: https://github.com/NSLS-II/profile-collection-ci/blob/726ebc6a618caadfdfc764579471b78e802804cf/azure-linux.yml#L145-L148:

import ophyd
import functools
ophyd.signal.EpicsSignalBase.wait_for_connection = functools.partialmethod(ophyd.signal.EpicsSignalBase.wait_for_connection, timeout=60)

In general, I noticed the timeout is set to different values in different places -- from 1 to 10 seconds. I think we should homogenize it to something standard, or, at least, configurable. From previous discussions (e.g., in Nikea Slack and caproto/caproto#512) I learned that pyepics timeout value is 5 seconds, which seems to be a reasonable value.

cc @danielballan @tacaswell

@mrakitin
Copy link

mrakitin commented Dec 7, 2019

Also, regarding the timeouts while setting the PVs, here can be a potential solution and corresponding debug logs: bluesky/ophyd#779 (heavy WIP!)

@prjemian
Copy link
Collaborator Author

prjemian commented Feb 4, 2020

We wlil use bluesky this week with the BES Pilot Project team. That will satisfy this issue.

sureshnaps pushed a commit that referenced this issue Feb 4, 2020
@prjemian
Copy link
Collaborator Author

prjemian commented Feb 5, 2020

Add feature to Lambda to close the shutter once acquisition of frames is completed, even when processing is still needed by the CAM plugin.

@prjemian
Copy link
Collaborator Author

prjemian commented Feb 5, 2020

For the lambda, control the shutter during the trigger() method. Wait a brief moment after opening the shutter (5 ms, might be shorter) before proceeding with trigger

@prjemian
Copy link
Collaborator Author

prjemian commented Feb 5, 2020

Rigaku detector:

  • removed completed subscription
  • trigger(): brief pause (100 ms, could be shorter) after starting acquisition

@prjemian
Copy link
Collaborator Author

prjemian commented Feb 5, 2020

AD_Acquire() plan:

@prjemian
Copy link
Collaborator Author

prjemian commented Feb 5, 2020

Lambda testing:

  • ensure certain constants are defined with current values
  • move detector stage to desired in-beam position

@prjemian
Copy link
Collaborator Author

prjemian commented Feb 5, 2020

Rigaku testing:

  • update workflow information (strings)

prjemian pushed a commit that referenced this issue Feb 5, 2020
prjemian pushed a commit that referenced this issue Feb 5, 2020
prjemian pushed a commit that referenced this issue Feb 5, 2020
prjemian pushed a commit that referenced this issue Feb 5, 2020
prjemian pushed a commit that referenced this issue Feb 6, 2020
prjemian pushed a commit that referenced this issue Feb 6, 2020
@prjemian
Copy link
Collaborator Author

prjemian commented Feb 7, 2020

Issue #146 might stop us from reaching this goal. We can continue to operate but we know we'll have to restart the EPICS IOC often.

@qzhang234
Copy link
Collaborator

Created a new milestone for this issue:

https://github.com/aps-8id-dys/ipython-8idiuser/milestone/12

@qzhang234
Copy link
Collaborator

Closing this issue now

@prjemian
Copy link
Collaborator Author

I can see closing this issue since 8-ID-I can operate with Bluesky for a week without switching back to SPEC. Continuous operation for one week, without unplanned exceptions due to the software, is the next step.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants