Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Skipping jammed Rigaku measurement and restart run engine? #230

Closed
qzhang234 opened this issue Sep 27, 2020 · 9 comments · Fixed by #236
Closed

Skipping jammed Rigaku measurement and restart run engine? #230

qzhang234 opened this issue Sep 27, 2020 · 9 comments · Fixed by #236
Assignees
Labels
enhancement New feature or request

Comments

@qzhang234
Copy link
Collaborator

I noticed the same error on the LabView panel when I ran XSPA with Spec. Guess the jamming we saw in #223 two weeks ago is not a Bluesky problem.

This raises a question: Is there a way to modify Bluesky so that it automatically moves on to the next measurement if XSPA does not respond for more than 60 s (each measurement should take no more than 5 s)?

@qzhang234 qzhang234 added the enhancement New feature or request label Sep 27, 2020
@qzhang234
Copy link
Collaborator Author

One thing I was also thinking: is it possible to send out an email from Bluesky when a scan crashed or hung?

@prjemian
Copy link
Collaborator

prjemian commented Sep 27, 2020

Yes. See EmailNotifications(). (edited) This example is now in the docs:

from apstools.utils import EmailNotifications

SENDER_EMAIL = "8idiuser@aps.anl.gov"
email_notices = EmailNotifications(SENDER_EMAIL)
email_notices.add_addresses(
    "joe.user@anl.gov",
    "instrument_team@aps.anl.gov",
    # others?
)

# then, when some condition occurs
if feedback_limits_approached:
    subject = "Feedback problem"
    message = "Feedback is very close to its limits."
    email_notices.send(subject, message)

@prjemian
Copy link
Collaborator

Is there a way to modify Bluesky so that it automatically moves on to the next measurement if XSPA does not respond for more than 60 s (each measurement should take no more than 5 s)?

If we can catch the timeout, for sure we can do this.

@prjemian
Copy link
Collaborator

We might want to catch a sequence of n consecutive jams to make sure we do not retry a hopeless situation.

@qzhang234
Copy link
Collaborator Author

qzhang234 commented Sep 27, 2020

Yes. I would say 3 retries would be enough.

This Rigaku timeout bug has occurred twice with Bluesky in #223 and once with Spec this week. It appears to be a recurring and reproducible problem. Nakaye doesn't know the source of the bug so we'll have to fix it from our end. Hopefully implementing the Bluesky re-throw will permanently fix this bug.

Also now that I think about this, most of the jam or crash when operating Rigaku/Bluesky can be fixed by simply Ctrl+C and restart the plan. Maybe this implementation is the last step towards our milestone of one week of continuous Bluesky user operation.

The beam will be down next Monday (09/28) at 8 am and doesn't come back till Thursday 8 am (10/01). This is a great opportunity, so let's get this done before the beam is back up. @prjemian Please let me know if there's anything that I can help.

Thanks!

@qzhang234
Copy link
Collaborator Author

The same bug occurred again while running with Spec on 09/27, 11:35 pm. I'm therefore changing the label to 'high priority'.

It looks like our best chance is to run Bluesky for the week of 10/01 - 10/12 with the rethrow capacity implemented.

@prjemian There's no beam till 10/01. Please advise on how we should start working on this. Thanks!

Screen Shot 2020-09-28 at 2 35 04 PM

Screen Shot 2020-09-28 at 2 32 13 PM

@prjemian prjemian mentioned this issue Sep 29, 2020
@prjemian
Copy link
Collaborator

prjemian commented Sep 29, 2020

So, we want to implement a timeout around a call to yield from AD_Acquire().

If timeout, then:

  • wait for ~5 minutes
  • retry up to n times
  • if retries exhausted:
    • send email
    • detector is not responding so abort the scan.

@qzhang234
Copy link
Collaborator Author

Just to leave a note that the LabView hang occurred again at 23:33 on 09/29

Screen Shot 2020-09-30 at 12 41 01 PM

Screen Shot 2020-09-30 at 12 41 56 PM

@prjemian
Copy link
Collaborator

prjemian commented Oct 5, 2020

Commit 2d2d672 should also handle the ReadTimeout problems affecting #233

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants