Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: capturing processed data #75

Open
danielballan opened this issue Jan 27, 2021 · 1 comment
Open

Discussion: capturing processed data #75

danielballan opened this issue Jan 27, 2021 · 1 comment

Comments

@danielballan
Copy link

Spinning off from this comment:

For the most part, we intend processed data to go in a separate BlueskyRun, which may reference the BlueskyRun(s) with the original data. There are several reasons to put this in a separate Run rather than an additional stream in the original Run.

  1. The data management for processed / derived / analyzed data may be different than that of raw data---for example, rules about who can access and how long it is retained in the system.
  2. For any given raw data set, there may be multiple process / derived / analyzed data sets, and expressing this "one to many" relationship inside streams will get awkward.
  3. Many part of the Bluesky infrastructure make the assumption that once a BlueskyRun is complete (i.e. once the 'stop' document is emitted by the RunEngine) that it will not change. This assumption simplifies a lot of things. Breaking it to add streams after the fact comes with a high complexity cost.

This is our working theory of how to capture analysis results in Databroker: https://blueskyproject.io/databroker/docs-rewrite-draft/how-to/store-analysis-results.html (Note: This link is to a preview of new Databroker documentation that is being evaluated by some users. It will be moved to https://blueskyproject.io/databroker/how-to/store-analysis-results.html

One could then imagine queries like "Show me all the processed results for Scan ID X," or "Given this processed result, find me the raw data."

However some analysis that can be done cheaply in real time during data acquisition and in a rote fashion that is highly unlikely to require re-processing with different parameters might be done in the Ophyd/Bluesky layer as part of data acquisition, and could be including in a stream in the original BlueskyRun. That particular case stays on the right side of points 1-3 above.

@gfabbris
Copy link
Contributor

This is really nice, and I expect it to be very useful for our spectroscopy data processing (likely using larch). I will give it a try before I start asking a bunch of trivial stuff.

I suspect that the threshold for calling a data processing cheap can be somewhat fuzzy. But would you call the data processing described in #42 cheap? Maybe it wasn't very clear there, but the reason I asked about it is that if we can add the processed XANES/XMCD to a new stream, then we could just plot that stream. Having this would also be useful to users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants