Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spectrum to beam object #158

Closed
wants to merge 7 commits into from
Closed

Add spectrum to beam object #158

wants to merge 7 commits into from

Conversation

phyy-nx
Copy link
Contributor

@phyy-nx phyy-nx commented Apr 8, 2020

Adds these methods:
set_spectrum(flex.double energies, flex.double weights)
get_spectrum_energies()
get_spectrum_weights()
get_weighted_wavelength()

Two flex arrays, energies (eV) and weights (dimensionless), represent the spectrum as a scatter plot. This follows the conventions set up in nexusformat/definitions#717. The new function get_weighted_wavelength uses a simple weighted mean to determine a wavelength.

It is ok if the beam has a wavelength != get_weighted_wavelength. This represents a calibrated wavelength determined from processing the spectrum.

is_similar_to tests get_weighted_wavelength between the two beams, if a spectrum is present.

Includes tests.

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Apr 8, 2020

This is an alternate version of #157 that doesn't use class derivation. It still does serialize the spectrum to the json file which increases file sizes dramatically. This is a real problem, and some solutions are being discussed.

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Apr 14, 2020

Any issues? I'd like to merge this and close #157 tomorrow, if possible. Thanks!

@graeme-winter
Copy link
Collaborator

This is an alternate version of #157 that doesn't use class derivation. It still does serialize the spectrum to the json file which increases file sizes dramatically. This is a real problem, and some solutions are being discussed.

You have a proposal of how to fix this before merging the PR?

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Apr 15, 2020

TL;DR: we discussed several ways of fixing the serialization issue but nothing seems to be straightforward. The timings/file sizes are impacted but processing seems fast enough for us that we want to punt the issue for now.

Details: this implementation saves two big arrays of numbers for each shot. The .expt file sizes get a lot bigger. I ran some tests using SwissFEL master files for the JF16M (1400+ images). Some numbers:

No spectra:

  • master h5 file: 1.2MB
  • imported.expt: 2.0MB
  • import time: 18.7s
  • Read time*: 1.4s

With spectra:

  • master h5 file: 58MB
  • imported.expt: 144MB
  • import time: 41.8s
  • Read time*: 12.9s

(* time to de-serialize the 1400+ experiment list)

I do think I can save some space in the master h5 file by only storing the energy spectrum axis once, but that won't affect the imported.expt file size or the timings.

Unfortunately I can't come up with a way to fix this. I'd prefer not to serialize the spectra at all, presuming it to be immutable, but then the beam object would need a link back to the raw data, which is not currently a thing in dxtbx. I mucked around with this a bit, creating a custom NeXusBeam object in nexus.py that had a link back to the raw data and that link was serialized instead, but I gave up on it pretty quickly.

@graeme-winter
Copy link
Collaborator

So following the call we had a chat about this, and we have a few concerns. One is that this bakes in the nexus model of a spectrum to be the only way we can describe it, for example a Gaussian is not allowed. Also for a neutron spallation source we may want to have a very broad spectrum which could get cumbersome with this model

I wonder if a better model for this could be for a beam to have a spectrum as a separate object which conforms to some API but which could have multiple implementations including the one you have above. An alternative could be a simple Gaussian with 1% band pass say for multi layer optic systems

Thoughts?

@graeme-winter
Copy link
Collaborator

Obviously the null value could be a standard monochromatic beam

@graeme-winter
Copy link
Collaborator

For what it’s worth this will also extend conceptually for the beam shape as well, where that could be a simple model or a complex measured image. So I think revisiting the fundamental design here could help

I welcome alternative perspectives

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Apr 16, 2020

@graeme-winter I'm interested in this as an option. I do think we can punt it down the road a bit. It is very easy to write backwards compatibility into the deserialization methods if the object model changes.

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Apr 16, 2020

I have created a new branch, spectrum_beam2_usepickles, which greatly saves on file space and read time. See commit 47852f2. If agreeable, I'll add that commit to this pull request. The change saves the spectra to an external pickle file instead of the json ascii file. This is similar to the lookup.mask feature that already exists. (Note it does exacerbate dials/dials#1238.)

New stats:
With external spectra pickle:

master h5 file: 30MB <-- savings comes from now only storing 1 energy array
imported.expt: 2MB, + 18MB spectrum.pickle
import time: 31.4s
Read time*: 2.1s

These are much better numbers. Note, I couldn't use the external lookup code directly as that applies only to image sets.

@graeme-winter
Copy link
Collaborator

@graeme-winter I'm interested in this as an option. I do think we can punt it down the road a bit. It is very easy to write backwards compatibility into the deserialization methods if the object model changes.

Hows about make this an external object, which does it's own lookup via pickle as above - to build in the idea that it may evolve. If you embed it directly in beam it can never be removed... I have bad experiences here.

NXspectrum class or similar which implements spectrum, with a spectrum pointer on beam?

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Apr 16, 2020

Thinking more about this, I'm not totally on board with a separate object. I'm arriving at the conclusion that NeXus and dxtbx should be more tightly linked in our development efforts. In other words, if there is something missing that we need to describe in our experiments, we should consider the NeXus solution that will be used to represent the metadata, and match the dxtbx interfaces to that spec. If we need something in NeXus that isn't there to support a specific dxtbx interface that we need, then we should work on adding it to NeXus.

In this case, the provided examples are monchromatic source, source with 1D spectra, source with a Gaussian spectrum, source with a known beam profile, pink beam, and neutron (of note, there is a TOF neutron NeXus application definition). I think all these cases are supported in NeXus already (see for example the profile and incident_beam_size fields for the beam profile case, and incident_wavelength_spread for a gaussian spectra). These things are all properties on the NXbeam class, so preserving that interface in dxtbx makes sense to me. After all, a spectrum and a beam shape are different, and they are both something that a beam has.

If you still think that a beam object should produce a spectrum object, then what would the base class of the spectrum object be, such that other things could derive from it, such as a beam profile? Ideas:

  • BeamShape
  • BeamDetails
  • BeamParameters
  • BeamBucket
  • BeamExtras
  • BeamSpec

@alyubimov
Copy link

I've got a very naive question, from a chemist's perspective, about this:

No spectra:

  • master h5 file: 1.2MB
  • imported.expt: 2.0MB
  • import time: 18.7s
  • Read time*: 1.4s

With spectra:

  • master h5 file: 58MB
  • imported.expt: 144MB
  • import time: 41.8s
  • Read time*: 12.9s

(* time to de-serialize the 1400+ experiment list)

What-all is recorded in the "spectrum" that it requires this much memory? I would've thought a spectrum could be expressed as a 1-dimensional list: measurement start (e.g. 120 Angstrom), step size (e.g. 5 Angstrom), and then the list of counts. What else is there that requires 140+ MB?

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Apr 17, 2020

So it's a great question. These spectra are 2500 energy channels and 2500 'weights' per image. Here's an example spectra from SwissFEL (weights vs. eV):

image

In the imported json file, each number is represented in ASCII and seperated by things like newlines, spaces, and commas, that's what adds up to 144 MB. The new version which uses an external pickle file for the spectra, is more like 18MB, plus it only saves the energy channels once. Much more realistic for 1400*2500*double precision for the whole set.

@alyubimov
Copy link

Okay, the amount of information you're referring to is oceanically vast compared to the graph of a spectrum that you're showing, which to my naive eye looks like a 1D array of ~100 integers. Is there a reason to keep all the "things like newlines, spaces, and commas", as well as all the other stuff that adds up to 144 MB? Do you use - or anticipate using - any of that in any calculations?

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Apr 17, 2020 via email

@graeme-winter
Copy link
Collaborator

So yah maybe that is not so easy 🤔

dials/dials#1238 (comment)

@phyy-nx phyy-nx changed the base branch from multiwavelength_nexus to master May 4, 2020 21:55
@phyy-nx
Copy link
Contributor Author

phyy-nx commented May 4, 2020

Ok, now that #151 is merged, I've changed the base of this PR to master, rebased on master, and I've added 69d1d56 to move the spectra into pickle files instead of the json text. This is again ready for review.

@graeme-winter if you have requests for a different object model could you restate them here?

@phyy-nx phyy-nx requested a review from graeme-winter May 4, 2020 22:16
@dermen
Copy link

dermen commented May 5, 2020

Rather than use pickle how about use hdf5 ? I worry about py2/3 compatibility with pickle.

@dagewa
Copy link
Member

dagewa commented May 5, 2020

There is also a security risk with pickle, which means it should ideally be avoided in cases like this when you are just writing data. See dials/dials#512 for details. This is one of the main reasons that we switched to using message pack for reflection tables.

@phyy-nx
Copy link
Contributor Author

phyy-nx commented May 6, 2020

Ok, I've got a version using simple h5 files now commited. Thoughts?

@phyy-nx
Copy link
Contributor Author

phyy-nx commented May 6, 2020

Stats update. Pickle version (as before):
master h5 file: 30MB
imported.expt: 2MB, + 18MB spectrum.pickle
import time: 31.4s
Read time: 2.1s

New h5 version:
master h5 file: 30MB
imported.expt: 2MB, + 29MB spectrum.h5
import time: 33.9s
Read time: ~2.5 s

@graeme-winter
Copy link
Collaborator

graeme-winter commented May 6, 2020

For what it's worth I am not hostile to the pickle solution, since this is

  • in line with what we do with masks
  • leaves a flag in place that there is something here which needs doing properly

Since we need to do $something properly anyway for masks maybe we could do those at the same time? Or at the very least not come up with two different solutions to one problem.

@dagewa
Copy link
Member

dagewa commented May 6, 2020

The trouble with flags is that they're often just left waving in the breeze. Replacing pickle for masks is long overdue. Closing up on a year now: dials/dials#821

@graeme-winter
Copy link
Collaborator

@dagewa you are not wrong -> guess we need to define how we are going to store this data in HDF5 then (such that the spectra do not somehow collide with masks, and we don't find ourselves being painted into a corner, and we do not have to keep track of many auxiliary files) -> 🤔

@phyy-nx
Copy link
Contributor Author

phyy-nx commented May 12, 2020

Bump. I'd really like to resolve this as the NeXus Gold Standard paper is in revisions again.

phyy-nx added 5 commits May 28, 2020 16:09
Adds these methods:
set_spectrum(flex.double energies, flex.double weights)
get_spectrum_energies()
get_spectrum_weights()
get_weighted_wavelength()

Two flex arrays, energies (eV) and weights (dimensionless), represent the spectrum as a scatter plot.  This follows the conventions set up in nexusformat/definitions#717. The new function get_weighted_wavelength uses a simple weighted mean to determine a wavelength.

It is ok if the beam has a wavelength != get_weighted_wavelength.  This represents a calibrated wavelength determined from processing the spectrum.

is_similar_to tests get_weighted_wavelength between the two beams, if a spectrum is present.

Includes tests.
- On serializing, create an external pickle file with the beam spectra and replace the spectra arrays in the dictionary with indices into the pickle file
- On deserializing, read the external pickle file back and set the spectra appropiately
- Change to_dict and from_dict to use flex.doubles instead of lists of floats. Not a primitive dictionary any longer but much faster to read/write
@phyy-nx
Copy link
Contributor Author

phyy-nx commented May 28, 2020

Rebased onto master.

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Jul 23, 2020

Hi folks, we want to merge this. We are needing it for our research, specifically for a DOE milestone due end of September. If there are objections, please sing out and let us know ASAP. We recognize that the ideal solution would resolve this and dials/dials#1238 at the same time, but we can't make the perfect the enemy of the good. Thanks.

@Anthchirp
Copy link
Member

The last time the tests were run on this PR they failed. I've just updated the branch to see what the current status is.
Other discussions aside at the minimum the tests should pass for it to be merged.

@graeme-winter
Copy link
Collaborator

graeme-winter commented Jul 24, 2020

@phyy-nx I am in a difficult position here

I have raised concerns with this approach above e.g. one specific NeXus implementation is what we have to use for any spectral information rather than other approaches which may be computationally well justified e.g. for a single stable undulator peak from a pink beam source like VMXi. I am also not happy with "NeXus defines how you are allowed to write data analysis software" as a concept.

Has the serialisation issue discussed above been resolved?

Re: " It is very easy to write backwards compatibility into the deserialization methods if the object model changes." - every time we have been in this situation it has proven impossible to revisit, because it gets published somewhere and must never change after that. This will I expect happen here.

However I am in a really nasty situation here. None of the concerns I have raised have been addressed (as far as I am aware) however to not merge this is going to prevent DOE grant deliverables being delivered and I will be held responsible for this.

So, the only winning move is not to play.

@rjgildea
Copy link
Contributor

Stats update. Pickle version (as before):
master h5 file: 30MB
imported.expt: 2MB, + 18MB spectrum.pickle
import time: 31.4s
Read time: 2.1s

New h5 version:
master h5 file: 30MB
imported.expt: 2MB, + 29MB spectrum.h5
import time: 33.9s
Read time: ~2.5 s

If I understand this correctly, the master.h5 file contains both the raw image data, and per-image beam spectra? And after importing stage, you have extracted the metadata, and an external data file containing only the per-image beam spectra? Yet these files are somehow (almost) as large as the master.h5 which contains the entire raw image data?

Perhaps I'm missing something, but I don't understand why it is necessary to extract and store the per-image beam spectrum in an external pickle/h5 file? Couldn't this be obtained from the master.h5 file on demand, analagously with how we access the raw image data via imageset.get_raw_data()?

@rjgildea
Copy link
Contributor

Perhaps I'm missing something, but I don't understand why it is necessary to extract and store the per-image beam spectrum in an external pickle/h5 file? Couldn't this be obtained from the master.h5 file on demand, analagously with how we

Ah, I see you touched on this point right at the top of the PR:

I'd prefer not to serialize the spectra at all, presuming it to be immutable, but then the beam object would need a link back to the raw data, which is not currently a thing in dxtbx. I mucked around with this a bit, creating a custom NeXusBeam object in nexus.py that had a link back to the raw data and that link was serialized instead, but I gave up on it pretty quickly.

Perhaps this is something that could be explored in more detail, as this seems like it would be a much preferable solution, if we can find a way to make it work.

@dagewa
Copy link
Member

dagewa commented Jul 24, 2020

Although experiment.beam does not link back to raw data, experiment.imageset does and both are simply attributes of an Experiment. I can understand resistance to making the core dxtbx Beam model rely on external data lookup, which may be why it feels like a separate object is required here. Having experiment.spectrum live alongside experiment.beam does not seem so bad to me. Another advantage of separation is that research work on sophisticated models like beam spectra can progress without much interference with the core library. In future we could consider merging an implementation into the core if it has proven to be standard, sort of like boost --> stl

@graeme-winter
Copy link
Collaborator

Although experiment.beam does not link back to raw data, experiment.imageset does and both are simply attributes of an Experiment. I can understand resistance to making the core dxtbx Beam model rely on external data lookup, which may be why it feels like a separate object is required here. Having experiment.spectrum live alongside experiment.beam does not seem so bad to me. Another advantage of separation is that research work on sophisticated models like beam spectra can progress without much interference with the core library. In future we could consider merging an implementation into the core if it has proven to be standard, sort of like boost --> stl

👍

I like the suggestion of experiment.spectrum a lot

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Jul 24, 2020

What's being asked here is can we avoid serializing the spectrum? Therefore we have a Fundamental Question: is the spectrum immutable? Can it never be refined in any way? If so, then it doesn't need to be serialized.

We currently have a use case where we think the spectrum energy channels may be wrong by a constant but unknown offset that we need to refine. Under this PR, we can trivially update the energy channels in our experiment list and serialize the result for downstream processing. If the spectrum is read-only then we need another approach, perhaps modifying the raw data itself. Now as it turns out, in NeXus this is easy, as it supports variants that allow us to update the metadata, so maybe that's ok. Does seem to defeat the purpose of having experimental models though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants