Add spectrum to beam object #158

phyy-nx · 2020-04-08T21:35:06Z

Adds these methods:
set_spectrum(flex.double energies, flex.double weights)
get_spectrum_energies()
get_spectrum_weights()
get_weighted_wavelength()

Two flex arrays, energies (eV) and weights (dimensionless), represent the spectrum as a scatter plot. This follows the conventions set up in nexusformat/definitions#717. The new function get_weighted_wavelength uses a simple weighted mean to determine a wavelength.

It is ok if the beam has a wavelength != get_weighted_wavelength. This represents a calibrated wavelength determined from processing the spectrum.

is_similar_to tests get_weighted_wavelength between the two beams, if a spectrum is present.

Includes tests.

phyy-nx · 2020-04-08T21:36:23Z

This is an alternate version of #157 that doesn't use class derivation. It still does serialize the spectrum to the json file which increases file sizes dramatically. This is a real problem, and some solutions are being discussed.

phyy-nx · 2020-04-14T21:09:02Z

Any issues? I'd like to merge this and close #157 tomorrow, if possible. Thanks!

graeme-winter · 2020-04-15T12:10:25Z

This is an alternate version of #157 that doesn't use class derivation. It still does serialize the spectrum to the json file which increases file sizes dramatically. This is a real problem, and some solutions are being discussed.

You have a proposal of how to fix this before merging the PR?

phyy-nx · 2020-04-15T15:58:05Z

TL;DR: we discussed several ways of fixing the serialization issue but nothing seems to be straightforward. The timings/file sizes are impacted but processing seems fast enough for us that we want to punt the issue for now.

Details: this implementation saves two big arrays of numbers for each shot. The .expt file sizes get a lot bigger. I ran some tests using SwissFEL master files for the JF16M (1400+ images). Some numbers:

No spectra:

master h5 file: 1.2MB
imported.expt: 2.0MB
import time: 18.7s
Read time*: 1.4s

With spectra:

master h5 file: 58MB
imported.expt: 144MB
import time: 41.8s
Read time*: 12.9s

(* time to de-serialize the 1400+ experiment list)

I do think I can save some space in the master h5 file by only storing the energy spectrum axis once, but that won't affect the imported.expt file size or the timings.

Unfortunately I can't come up with a way to fix this. I'd prefer not to serialize the spectra at all, presuming it to be immutable, but then the beam object would need a link back to the raw data, which is not currently a thing in dxtbx. I mucked around with this a bit, creating a custom NeXusBeam object in nexus.py that had a link back to the raw data and that link was serialized instead, but I gave up on it pretty quickly.

graeme-winter · 2020-04-16T17:27:21Z

So following the call we had a chat about this, and we have a few concerns. One is that this bakes in the nexus model of a spectrum to be the only way we can describe it, for example a Gaussian is not allowed. Also for a neutron spallation source we may want to have a very broad spectrum which could get cumbersome with this model

I wonder if a better model for this could be for a beam to have a spectrum as a separate object which conforms to some API but which could have multiple implementations including the one you have above. An alternative could be a simple Gaussian with 1% band pass say for multi layer optic systems

Thoughts?

graeme-winter · 2020-04-16T17:29:15Z

Obviously the null value could be a standard monochromatic beam

graeme-winter · 2020-04-16T17:34:17Z

For what it’s worth this will also extend conceptually for the beam shape as well, where that could be a simple model or a complex measured image. So I think revisiting the fundamental design here could help

I welcome alternative perspectives

phyy-nx · 2020-04-16T18:25:22Z

@graeme-winter I'm interested in this as an option. I do think we can punt it down the road a bit. It is very easy to write backwards compatibility into the deserialization methods if the object model changes.

phyy-nx · 2020-04-16T18:33:06Z

I have created a new branch, spectrum_beam2_usepickles, which greatly saves on file space and read time. See commit 47852f2. If agreeable, I'll add that commit to this pull request. The change saves the spectra to an external pickle file instead of the json ascii file. This is similar to the lookup.mask feature that already exists. (Note it does exacerbate dials/dials#1238.)

New stats:
With external spectra pickle:

master h5 file: 30MB <-- savings comes from now only storing 1 energy array
imported.expt: 2MB, + 18MB spectrum.pickle
import time: 31.4s
Read time*: 2.1s

These are much better numbers. Note, I couldn't use the external lookup code directly as that applies only to image sets.

graeme-winter · 2020-04-16T18:45:19Z

@graeme-winter I'm interested in this as an option. I do think we can punt it down the road a bit. It is very easy to write backwards compatibility into the deserialization methods if the object model changes.

Hows about make this an external object, which does it's own lookup via pickle as above - to build in the idea that it may evolve. If you embed it directly in beam it can never be removed... I have bad experiences here.

NXspectrum class or similar which implements spectrum, with a spectrum pointer on beam?

phyy-nx · 2020-04-16T22:57:41Z

Thinking more about this, I'm not totally on board with a separate object. I'm arriving at the conclusion that NeXus and dxtbx should be more tightly linked in our development efforts. In other words, if there is something missing that we need to describe in our experiments, we should consider the NeXus solution that will be used to represent the metadata, and match the dxtbx interfaces to that spec. If we need something in NeXus that isn't there to support a specific dxtbx interface that we need, then we should work on adding it to NeXus.

In this case, the provided examples are monchromatic source, source with 1D spectra, source with a Gaussian spectrum, source with a known beam profile, pink beam, and neutron (of note, there is a TOF neutron NeXus application definition). I think all these cases are supported in NeXus already (see for example the profile and incident_beam_size fields for the beam profile case, and incident_wavelength_spread for a gaussian spectra). These things are all properties on the NXbeam class, so preserving that interface in dxtbx makes sense to me. After all, a spectrum and a beam shape are different, and they are both something that a beam has.

If you still think that a beam object should produce a spectrum object, then what would the base class of the spectrum object be, such that other things could derive from it, such as a beam profile? Ideas:

BeamShape
BeamDetails
BeamParameters
BeamBucket
BeamExtras
BeamSpec

alyubimov · 2020-04-17T20:56:30Z

I've got a very naive question, from a chemist's perspective, about this:

No spectra:

master h5 file: 1.2MB

imported.expt: 2.0MB

import time: 18.7s

Read time*: 1.4s

With spectra:

master h5 file: 58MB

imported.expt: 144MB

import time: 41.8s

Read time*: 12.9s

(* time to de-serialize the 1400+ experiment list)

What-all is recorded in the "spectrum" that it requires this much memory? I would've thought a spectrum could be expressed as a 1-dimensional list: measurement start (e.g. 120 Angstrom), step size (e.g. 5 Angstrom), and then the list of counts. What else is there that requires 140+ MB?

phyy-nx · 2020-04-17T21:15:41Z

So it's a great question. These spectra are 2500 energy channels and 2500 'weights' per image. Here's an example spectra from SwissFEL (weights vs. eV):

In the imported json file, each number is represented in ASCII and seperated by things like newlines, spaces, and commas, that's what adds up to 144 MB. The new version which uses an external pickle file for the spectra, is more like 18MB, plus it only saves the energy channels once. Much more realistic for 1400*2500*double precision for the whole set.

alyubimov · 2020-04-17T21:25:28Z

Okay, the amount of information you're referring to is oceanically vast compared to the graph of a spectrum that you're showing, which to my naive eye looks like a 1D array of ~100 integers. Is there a reason to keep all the "things like newlines, spaces, and commas", as well as all the other stuff that adds up to 144 MB? Do you use - or anticipate using - any of that in any calculations?

phyy-nx · 2020-04-17T21:37:54Z

Rounding to int is possible, maybe even reasonable, but yah, it's 2500+ channels. Reducing the number of channels would require smoothing and that defeats the point of the analysis we want to do. And right, the pickle version just saves the flex arrays so it's as compact as a double can be, no other nonsense. Even so, the best approach would be links to the original data, which could be done if we moved to HDF5 (see #1238).

…

On Fri, Apr 17, 2020 at 2:25 PM Artem Lyubimov ***@***.***> wrote: Okay, the amount of information you're referring to is oceanically vast compared to the graph of a spectrum that you're showing, which to my naive eye looks like a 1D array of ~100 integers. Is there a reason to keep all the "things like newlines, spaces, and commas", as well as all the other stuff that adds up to 144 MB? Do you use - or anticipate using - any of that in any calculations? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#158 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADGY5SSSUGN6KVPQ5EHPH33RNDCNNANCNFSM4MEICSZA> .

graeme-winter · 2020-04-18T05:36:23Z

So yah maybe that is not so easy 🤔

dials/dials#1238 (comment)

phyy-nx · 2020-05-04T22:15:40Z

Ok, now that #151 is merged, I've changed the base of this PR to master, rebased on master, and I've added 69d1d56 to move the spectra into pickle files instead of the json text. This is again ready for review.

@graeme-winter if you have requests for a different object model could you restate them here?

dermen · 2020-05-05T15:54:45Z

Rather than use pickle how about use hdf5 ? I worry about py2/3 compatibility with pickle.

dagewa · 2020-05-05T17:13:49Z

There is also a security risk with pickle, which means it should ideally be avoided in cases like this when you are just writing data. See dials/dials#512 for details. This is one of the main reasons that we switched to using message pack for reflection tables.

phyy-nx · 2020-05-06T00:32:14Z

Ok, I've got a version using simple h5 files now commited. Thoughts?

phyy-nx · 2020-05-06T01:00:19Z

Stats update. Pickle version (as before):
master h5 file: 30MB
imported.expt: 2MB, + 18MB spectrum.pickle
import time: 31.4s
Read time: 2.1s

New h5 version:
master h5 file: 30MB
imported.expt: 2MB, + 29MB spectrum.h5
import time: 33.9s
Read time: ~2.5 s

graeme-winter · 2020-05-06T12:38:43Z

For what it's worth I am not hostile to the pickle solution, since this is

in line with what we do with masks
leaves a flag in place that there is something here which needs doing properly

Since we need to do $something properly anyway for masks maybe we could do those at the same time? Or at the very least not come up with two different solutions to one problem.

dagewa · 2020-05-06T12:43:18Z

The trouble with flags is that they're often just left waving in the breeze. Replacing pickle for masks is long overdue. Closing up on a year now: dials/dials#821

graeme-winter · 2020-05-06T12:45:51Z

@dagewa you are not wrong -> guess we need to define how we are going to store this data in HDF5 then (such that the spectra do not somehow collide with masks, and we don't find ourselves being painted into a corner, and we do not have to keep track of many auxiliary files) -> 🤔

phyy-nx · 2020-05-12T20:47:48Z

Bump. I'd really like to resolve this as the NeXus Gold Standard paper is in revisions again.

Adds these methods: set_spectrum(flex.double energies, flex.double weights) get_spectrum_energies() get_spectrum_weights() get_weighted_wavelength() Two flex arrays, energies (eV) and weights (dimensionless), represent the spectrum as a scatter plot. This follows the conventions set up in nexusformat/definitions#717. The new function get_weighted_wavelength uses a simple weighted mean to determine a wavelength. It is ok if the beam has a wavelength != get_weighted_wavelength. This represents a calibrated wavelength determined from processing the spectrum. is_similar_to tests get_weighted_wavelength between the two beams, if a spectrum is present. Includes tests.

- On serializing, create an external pickle file with the beam spectra and replace the spectra arrays in the dictionary with indices into the pickle file - On deserializing, read the external pickle file back and set the spectra appropiately - Change to_dict and from_dict to use flex.doubles instead of lists of floats. Not a primitive dictionary any longer but much faster to read/write

@nksauter

(thanks @nksauter)

phyy-nx · 2020-05-28T23:11:30Z

Rebased onto master.

phyy-nx · 2020-07-23T23:00:08Z

Hi folks, we want to merge this. We are needing it for our research, specifically for a DOE milestone due end of September. If there are objections, please sing out and let us know ASAP. We recognize that the ideal solution would resolve this and dials/dials#1238 at the same time, but we can't make the perfect the enemy of the good. Thanks.

Anthchirp · 2020-07-24T00:15:55Z

The last time the tests were run on this PR they failed. I've just updated the branch to see what the current status is.
Other discussions aside at the minimum the tests should pass for it to be merged.

graeme-winter · 2020-07-24T12:58:30Z

@phyy-nx I am in a difficult position here

I have raised concerns with this approach above e.g. one specific NeXus implementation is what we have to use for any spectral information rather than other approaches which may be computationally well justified e.g. for a single stable undulator peak from a pink beam source like VMXi. I am also not happy with "NeXus defines how you are allowed to write data analysis software" as a concept.

Has the serialisation issue discussed above been resolved?

Re: " It is very easy to write backwards compatibility into the deserialization methods if the object model changes." - every time we have been in this situation it has proven impossible to revisit, because it gets published somewhere and must never change after that. This will I expect happen here.

However I am in a really nasty situation here. None of the concerns I have raised have been addressed (as far as I am aware) however to not merge this is going to prevent DOE grant deliverables being delivered and I will be held responsible for this.

So, the only winning move is not to play.

rjgildea · 2020-07-24T13:15:02Z

Stats update. Pickle version (as before):
master h5 file: 30MB
imported.expt: 2MB, + 18MB spectrum.pickle
import time: 31.4s
Read time: 2.1s

New h5 version:
master h5 file: 30MB
imported.expt: 2MB, + 29MB spectrum.h5
import time: 33.9s
Read time: ~2.5 s

If I understand this correctly, the master.h5 file contains both the raw image data, and per-image beam spectra? And after importing stage, you have extracted the metadata, and an external data file containing only the per-image beam spectra? Yet these files are somehow (almost) as large as the master.h5 which contains the entire raw image data?

Perhaps I'm missing something, but I don't understand why it is necessary to extract and store the per-image beam spectrum in an external pickle/h5 file? Couldn't this be obtained from the master.h5 file on demand, analagously with how we access the raw image data via imageset.get_raw_data()?

rjgildea · 2020-07-24T13:29:35Z

Perhaps I'm missing something, but I don't understand why it is necessary to extract and store the per-image beam spectrum in an external pickle/h5 file? Couldn't this be obtained from the master.h5 file on demand, analagously with how we

Ah, I see you touched on this point right at the top of the PR:

I'd prefer not to serialize the spectra at all, presuming it to be immutable, but then the beam object would need a link back to the raw data, which is not currently a thing in dxtbx. I mucked around with this a bit, creating a custom NeXusBeam object in nexus.py that had a link back to the raw data and that link was serialized instead, but I gave up on it pretty quickly.

Perhaps this is something that could be explored in more detail, as this seems like it would be a much preferable solution, if we can find a way to make it work.

dagewa · 2020-07-24T13:50:29Z

Although experiment.beam does not link back to raw data, experiment.imageset does and both are simply attributes of an Experiment. I can understand resistance to making the core dxtbx Beam model rely on external data lookup, which may be why it feels like a separate object is required here. Having experiment.spectrum live alongside experiment.beam does not seem so bad to me. Another advantage of separation is that research work on sophisticated models like beam spectra can progress without much interference with the core library. In future we could consider merging an implementation into the core if it has proven to be standard, sort of like boost --> stl

graeme-winter · 2020-07-24T14:33:14Z

Although experiment.beam does not link back to raw data, experiment.imageset does and both are simply attributes of an Experiment. I can understand resistance to making the core dxtbx Beam model rely on external data lookup, which may be why it feels like a separate object is required here. Having experiment.spectrum live alongside experiment.beam does not seem so bad to me. Another advantage of separation is that research work on sophisticated models like beam spectra can progress without much interference with the core library. In future we could consider merging an implementation into the core if it has proven to be standard, sort of like boost --> stl

👍

I like the suggestion of experiment.spectrum a lot

phyy-nx · 2020-07-24T15:24:02Z

What's being asked here is can we avoid serializing the spectrum? Therefore we have a Fundamental Question: is the spectrum immutable? Can it never be refined in any way? If so, then it doesn't need to be serialized.

We currently have a use case where we think the spectrum energy channels may be wrong by a constant but unknown offset that we need to refine. Under this PR, we can trivially update the energy channels in our experiment list and serialize the result for downstream processing. If the spectrum is read-only then we need another approach, perhaps modifying the raw data itself. Now as it turns out, in NeXus this is easy, as it supports variants that allow us to update the metadata, so maybe that's ok. Does seem to defeat the purpose of having experimental models though.

phyy-nx force-pushed the spectrum_beam2 branch from 1564647 to dd955f5 Compare April 9, 2020 20:31

graeme-winter mentioned this pull request Apr 15, 2020

Add SpectrumBeam as derived class from Beam object #157

Closed

phyy-nx changed the base branch from multiwavelength_nexus to master May 4, 2020 21:55

phyy-nx force-pushed the spectrum_beam2 branch from dd955f5 to 6d96b4e Compare May 4, 2020 22:07

phyy-nx requested a review from graeme-winter May 4, 2020 22:16

phyy-nx added 5 commits May 28, 2020 16:09

Need an additional constructor to make pickling work.

e8d7eba

Rejigger as h5 instead of pickle

74d125e

Bugfix for experiment lists with 1 item only

4e02a48

(thanks @nksauter)

phyy-nx force-pushed the spectrum_beam2 branch from 36ba245 to 4e02a48 Compare May 28, 2020 23:10

Merge branch 'master' into spectrum_beam2

125390d

Get flex from somewhere else

674fa89

phyy-nx mentioned this pull request Jul 27, 2020

Add Spectrum as a read-only class obtainable from an imageset #201

Merged

phyy-nx closed this Jul 31, 2020

phyy-nx deleted the spectrum_beam2 branch July 31, 2020 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spectrum to beam object #158

Add spectrum to beam object #158

phyy-nx commented Apr 8, 2020

phyy-nx commented Apr 8, 2020

phyy-nx commented Apr 14, 2020

graeme-winter commented Apr 15, 2020

phyy-nx commented Apr 15, 2020

graeme-winter commented Apr 16, 2020

graeme-winter commented Apr 16, 2020

graeme-winter commented Apr 16, 2020

phyy-nx commented Apr 16, 2020

phyy-nx commented Apr 16, 2020 •

edited

Loading

graeme-winter commented Apr 16, 2020

phyy-nx commented Apr 16, 2020

alyubimov commented Apr 17, 2020

phyy-nx commented Apr 17, 2020

alyubimov commented Apr 17, 2020

phyy-nx commented Apr 17, 2020 via email

graeme-winter commented Apr 18, 2020

phyy-nx commented May 4, 2020

dermen commented May 5, 2020

dagewa commented May 5, 2020

phyy-nx commented May 6, 2020

phyy-nx commented May 6, 2020

graeme-winter commented May 6, 2020 •

edited

Loading

dagewa commented May 6, 2020

graeme-winter commented May 6, 2020

phyy-nx commented May 12, 2020

phyy-nx commented May 28, 2020

phyy-nx commented Jul 23, 2020 •

edited

Loading

Anthchirp commented Jul 24, 2020

graeme-winter commented Jul 24, 2020 •

edited

Loading

rjgildea commented Jul 24, 2020

rjgildea commented Jul 24, 2020

dagewa commented Jul 24, 2020 •

edited

Loading

graeme-winter commented Jul 24, 2020

phyy-nx commented Jul 24, 2020

Add spectrum to beam object #158

Add spectrum to beam object #158

Conversation

phyy-nx commented Apr 8, 2020

phyy-nx commented Apr 8, 2020

phyy-nx commented Apr 14, 2020

graeme-winter commented Apr 15, 2020

phyy-nx commented Apr 15, 2020

graeme-winter commented Apr 16, 2020

graeme-winter commented Apr 16, 2020

graeme-winter commented Apr 16, 2020

phyy-nx commented Apr 16, 2020

phyy-nx commented Apr 16, 2020 • edited Loading

graeme-winter commented Apr 16, 2020

phyy-nx commented Apr 16, 2020

alyubimov commented Apr 17, 2020

phyy-nx commented Apr 17, 2020

alyubimov commented Apr 17, 2020

phyy-nx commented Apr 17, 2020 via email

graeme-winter commented Apr 18, 2020

phyy-nx commented May 4, 2020

dermen commented May 5, 2020

dagewa commented May 5, 2020

phyy-nx commented May 6, 2020

phyy-nx commented May 6, 2020

graeme-winter commented May 6, 2020 • edited Loading

dagewa commented May 6, 2020

graeme-winter commented May 6, 2020

phyy-nx commented May 12, 2020

phyy-nx commented May 28, 2020

phyy-nx commented Jul 23, 2020 • edited Loading

Anthchirp commented Jul 24, 2020

graeme-winter commented Jul 24, 2020 • edited Loading

rjgildea commented Jul 24, 2020

rjgildea commented Jul 24, 2020

dagewa commented Jul 24, 2020 • edited Loading

graeme-winter commented Jul 24, 2020

phyy-nx commented Jul 24, 2020

phyy-nx commented Apr 16, 2020 •

edited

Loading

graeme-winter commented May 6, 2020 •

edited

Loading

phyy-nx commented Jul 23, 2020 •

edited

Loading

graeme-winter commented Jul 24, 2020 •

edited

Loading

dagewa commented Jul 24, 2020 •

edited

Loading