use bluesky, et al., for at least one user #46

prjemian · 2019-09-04T17:19:53Z

run user operations for at least 1 week with an expert user without having to resort back to SPEC in the middle of the week

prjemian · 2019-12-06T16:33:19Z

We had a good 10 hours of work yesterday with bluesky for the XPCS instrument using the Lambda detector. Bluesky was used for most of the instrument preparation (such as alignment scans and the routine remote control of various equipment). I believe we were able to complete a few successful measurements for the user.

Part of the day was spent developing a user plan that would measure at a list of temperatures (using the Lakeshore controller). This plan would automate an overnight sequence of measurements. Our work was plagued with many interruptions of the RunEngine execution of our plan due to TimeoutError exceptions (such as #114) from various EpicsSignal components. These signals had been read in previous steps (such as reading number 47 in a long scan, meaning that at least 46 reads had been successful). A common factor between the different EPICS PVs underlying these signals was not obvious. Different fields of EPICS records were involved such as .EGU, .VAL, .TDIR and different types of EPICS records (motor, ao, swait). Unfortunately, the console output was not preserved (next time, we'll save that).

Also, we experienced some failures to connect with one EPICS PV or another at the start of a bluesky session using ipython. The remedy has been to exit and restart. This usually worked. A few times, a different PV was the cause of a similar failure to connect.

Summary

We agree that the bluesky framework is not ready now for unattended operations at XPCS.

There are too many interruptions (Python exceptions which interrupt the bluesky RunEngine) and it seems these interruptions are at a low level in the framework (somewhere in the handling of EpicsSignal objects). Resolving these interruptions requires the full attention of experts at diagnosing at this low level. Since the XPCS instrument is now in a routine operations period and the beam line has a working alternative software, they decided to switch back to using SPEC for the remainder of the user's beam time.

The exception traces are too deep, in most cases. This is an opportunity to improve the exception handling in the RunEngine. When a problem is due to a user's plan, the exception is raised from a deep level in the bluesky framework and passed directly to the console. We expect such exceptions should be caught by the RunEngine, then raised as a new PlanException with the focus on the user plan step that provoked the exception. A terse mention in the PlanException would mention the specific deeper exception without the verbose trace. To preserve valuable diagnostics, the RunEngine could log separately the deeper exception reports.

TODO list

resolve the TimeoutError (such as FailedStatus exception during AD_Acquire #114) before the next scheduled operations
schedule XPCS beam time with bluesky operations as the top priority. Needs a friendly user to cooperate.
be clear before the beam time on what tasks to be accomplished so we do not spend beam time developing & testing new software.
save all the console output when such exceptions arise

danielballan · 2019-12-06T16:48:43Z

Thanks for the thorough write up. It sounds like there are two categories of problem: (1) something is causing TimeoutErrors and interrupting scans and (2) the deep tracebacks present a usability issue.

kmpeters · 2019-12-06T20:16:56Z

I've seen TimeoutError exceptions in fly-scan python code using pyepics running on Windows at 34ID-E. The problem occurred infrequently (less frequently than once every few weeks). The PV that was being accessed at the time of the timeout would change each time. Sometimes the timeout was on a get and other times it was on put. The PV that was being accessed when the exception occurred was always available via the command line immediately after the exception occurred. I wrote safer put and get functions that handled the timeout exception and printed info about the problem and I haven't heard from users about the problem since then.

I don't remember the python, pyepics3, and EPICS lib versions. I can look them up on the next shutdown day.

prjemian · 2019-12-06T21:10:52Z

Thanks! This is valuable input, helping us to divide our problem space of where to look next.

mrakitin · 2019-12-07T00:48:26Z

Also, we experienced some failures to connect with one EPICS PV or another at the start of a bluesky session using ipython. The remedy has been to exit and restart. This usually worked. A few times, a different PV was the cause of a similar failure to connect.

We noticed that problem at a few "heavy" (in terms of the number of PVs) beamlines. I think we need to come up with a common solution with longer allowed connection times. A temporary, in-place fix can be similar to what we use in our CI testing, i.e. monkey-patching the timeout of the EpicsSignal: https://github.com/NSLS-II/profile-collection-ci/blob/726ebc6a618caadfdfc764579471b78e802804cf/azure-linux.yml#L145-L148:

import ophyd
import functools
ophyd.signal.EpicsSignalBase.wait_for_connection = functools.partialmethod(ophyd.signal.EpicsSignalBase.wait_for_connection, timeout=60)

In general, I noticed the timeout is set to different values in different places -- from 1 to 10 seconds. I think we should homogenize it to something standard, or, at least, configurable. From previous discussions (e.g., in Nikea Slack and caproto/caproto#512) I learned that pyepics timeout value is 5 seconds, which seems to be a reasonable value.

cc @danielballan @tacaswell

mrakitin · 2019-12-07T00:58:06Z

Also, regarding the timeouts while setting the PVs, here can be a potential solution and corresponding debug logs: bluesky/ophyd#779 (heavy WIP!)

prjemian · 2020-02-04T20:14:34Z

We wlil use bluesky this week with the BES Pilot Project team. That will satisfy this issue.

prjemian · 2020-02-05T18:44:13Z

Add feature to Lambda to close the shutter once acquisition of frames is completed, even when processing is still needed by the CAM plugin.

prjemian · 2020-02-05T18:45:32Z

For the lambda, control the shutter during the trigger() method. Wait a brief moment after opening the shutter (5 ms, might be shorter) before proceeding with trigger

prjemian · 2020-02-05T18:47:09Z

Rigaku detector:

removed completed subscription
trigger(): brief pause (100 ms, could be shorter) after starting acquisition

prjemian · 2020-02-05T18:49:17Z

AD_Acquire() plan:

WIP: pass metadata from user into document stream (for How to specify sample name in XPCS acquisition? #141), not working yet
update workflow information (strings)

prjemian · 2020-02-05T18:50:49Z

Lambda testing:

ensure certain constants are defined with current values
move detector stage to desired in-beam position

prjemian · 2020-02-05T18:51:13Z

Rigaku testing:

update workflow information (strings)

prjemian · 2020-02-07T18:41:09Z

Issue #146 might stop us from reaching this goal. We can continue to operate but we know we'll have to restart the EPICS IOC often.

qzhang234 · 2020-06-27T18:35:19Z

Created a new milestone for this issue:

https://github.com/aps-8id-dys/ipython-8idiuser/milestone/12

qzhang234 · 2020-06-27T18:35:29Z

Closing this issue now

prjemian · 2020-06-27T19:23:35Z

I can see closing this issue since 8-ID-I can operate with Bluesky for a week without switching back to SPEC. Continuous operation for one week, without unplanned exceptions due to the software, is the next step.

prjemian added the enhancement New feature or request label Sep 4, 2019

prjemian added this to the 2019-10 operations milestone Sep 4, 2019

prjemian assigned prjemian, sureshnaps, jkirchman, qzhang234 and ericmdufresne Sep 4, 2019

dylanmcreynolds mentioned this issue Sep 4, 2019

XPCS at APS 8-ID-I Phase 1 lightsources/project-management#59

Open

4 tasks

prjemian modified the milestones: 2019-10 first operations, 2019-10 next operations Sep 13, 2019

prjemian modified the milestones: 2019-10 next operations, 2019-11-04 testing, 2019-12-09 demo for BES Pilot Project Nov 5, 2019

prjemian modified the milestones: 2019-12-09 demo for BES Pilot Project, 2020-01-24 shutdown Dec 4, 2019

prjemian mentioned this issue Dec 6, 2019

exception traces are reported too deeply, in most cases bluesky/bluesky#1274

Closed

prjemian mentioned this issue Dec 6, 2019

datafilename register defined with stale value #122

Closed

prjemian mentioned this issue Dec 10, 2019

TimeoutError from EpicsSignal bluesky/ophyd#776

Closed

prjemian modified the milestones: 2020-01 shutdown, 2020-02-04 beam time Jan 29, 2020

sureshnaps pushed a commit that referenced this issue Feb 4, 2020

WIP #46 similar test codes

b23279e

prjemian pushed a commit that referenced this issue Feb 5, 2020

MNT #46 checkpoint

35f0a20

prjemian pushed a commit that referenced this issue Feb 5, 2020

WIP #46

8393c59

prjemian pushed a commit that referenced this issue Feb 5, 2020

WIP #46

446ec6b

prjemian pushed a commit that referenced this issue Feb 5, 2020

ENH #46 more RE.md

efabe8a

prjemian pushed a commit that referenced this issue Feb 6, 2020

WIP #46

e89b59e

prjemian pushed a commit that referenced this issue Feb 6, 2020

MNT #46

9dd3758

prjemian pushed a commit that referenced this issue Feb 7, 2020

WIP #46 checkpoint

39e1d67

prjemian modified the milestones: 2020-02-04 beam time, 2020-03 operations Feb 7, 2020

prjemian mentioned this issue Feb 11, 2020

Rigaku data is not processing all the frames #143

Closed

qzhang234 closed this as completed Jun 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use bluesky, et al., for at least one user #46

use bluesky, et al., for at least one user #46

prjemian commented Sep 4, 2019

prjemian commented Dec 6, 2019

danielballan commented Dec 6, 2019

kmpeters commented Dec 6, 2019 •

edited

Loading

prjemian commented Dec 6, 2019

mrakitin commented Dec 7, 2019

mrakitin commented Dec 7, 2019

prjemian commented Feb 4, 2020

prjemian commented Feb 5, 2020

prjemian commented Feb 5, 2020

prjemian commented Feb 5, 2020 •

edited

Loading

prjemian commented Feb 5, 2020

prjemian commented Feb 5, 2020

prjemian commented Feb 5, 2020

prjemian commented Feb 7, 2020

qzhang234 commented Jun 27, 2020

qzhang234 commented Jun 27, 2020

prjemian commented Jun 27, 2020

use bluesky, et al., for at least one user #46

use bluesky, et al., for at least one user #46

Comments

prjemian commented Sep 4, 2019

prjemian commented Dec 6, 2019

Summary

TODO list

danielballan commented Dec 6, 2019

kmpeters commented Dec 6, 2019 • edited Loading

prjemian commented Dec 6, 2019

mrakitin commented Dec 7, 2019

mrakitin commented Dec 7, 2019

prjemian commented Feb 4, 2020

prjemian commented Feb 5, 2020

prjemian commented Feb 5, 2020

prjemian commented Feb 5, 2020 • edited Loading

prjemian commented Feb 5, 2020

prjemian commented Feb 5, 2020

prjemian commented Feb 5, 2020

prjemian commented Feb 7, 2020

qzhang234 commented Jun 27, 2020

qzhang234 commented Jun 27, 2020

prjemian commented Jun 27, 2020

kmpeters commented Dec 6, 2019 •

edited

Loading

prjemian commented Feb 5, 2020 •

edited

Loading