-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spectrum to beam object #158
Conversation
This is an alternate version of #157 that doesn't use class derivation. It still does serialize the spectrum to the json file which increases file sizes dramatically. This is a real problem, and some solutions are being discussed. |
Any issues? I'd like to merge this and close #157 tomorrow, if possible. Thanks! |
You have a proposal of how to fix this before merging the PR? |
TL;DR: we discussed several ways of fixing the serialization issue but nothing seems to be straightforward. The timings/file sizes are impacted but processing seems fast enough for us that we want to punt the issue for now. Details: this implementation saves two big arrays of numbers for each shot. The .expt file sizes get a lot bigger. I ran some tests using SwissFEL master files for the JF16M (1400+ images). Some numbers: No spectra:
With spectra:
(* time to de-serialize the 1400+ experiment list) I do think I can save some space in the master h5 file by only storing the energy spectrum axis once, but that won't affect the imported.expt file size or the timings. Unfortunately I can't come up with a way to fix this. I'd prefer not to serialize the spectra at all, presuming it to be immutable, but then the beam object would need a link back to the raw data, which is not currently a thing in dxtbx. I mucked around with this a bit, creating a custom NeXusBeam object in nexus.py that had a link back to the raw data and that link was serialized instead, but I gave up on it pretty quickly. |
So following the call we had a chat about this, and we have a few concerns. One is that this bakes in the nexus model of a spectrum to be the only way we can describe it, for example a Gaussian is not allowed. Also for a neutron spallation source we may want to have a very broad spectrum which could get cumbersome with this model I wonder if a better model for this could be for a beam to have a spectrum as a separate object which conforms to some API but which could have multiple implementations including the one you have above. An alternative could be a simple Gaussian with 1% band pass say for multi layer optic systems Thoughts? |
Obviously the null value could be a standard monochromatic beam |
For what it’s worth this will also extend conceptually for the beam shape as well, where that could be a simple model or a complex measured image. So I think revisiting the fundamental design here could help I welcome alternative perspectives |
@graeme-winter I'm interested in this as an option. I do think we can punt it down the road a bit. It is very easy to write backwards compatibility into the deserialization methods if the object model changes. |
I have created a new branch, spectrum_beam2_usepickles, which greatly saves on file space and read time. See commit 47852f2. If agreeable, I'll add that commit to this pull request. The change saves the spectra to an external pickle file instead of the json ascii file. This is similar to the lookup.mask feature that already exists. (Note it does exacerbate dials/dials#1238.) New stats: master h5 file: 30MB <-- savings comes from now only storing 1 energy array These are much better numbers. Note, I couldn't use the external lookup code directly as that applies only to image sets. |
Hows about make this an external object, which does it's own lookup via pickle as above - to build in the idea that it may evolve. If you embed it directly in beam it can never be removed... I have bad experiences here. NXspectrum class or similar which implements spectrum, with a spectrum pointer on beam? |
Thinking more about this, I'm not totally on board with a separate object. I'm arriving at the conclusion that NeXus and dxtbx should be more tightly linked in our development efforts. In other words, if there is something missing that we need to describe in our experiments, we should consider the NeXus solution that will be used to represent the metadata, and match the dxtbx interfaces to that spec. If we need something in NeXus that isn't there to support a specific dxtbx interface that we need, then we should work on adding it to NeXus. In this case, the provided examples are monchromatic source, source with 1D spectra, source with a Gaussian spectrum, source with a known beam profile, pink beam, and neutron (of note, there is a TOF neutron NeXus application definition). I think all these cases are supported in NeXus already (see for example the profile and incident_beam_size fields for the beam profile case, and incident_wavelength_spread for a gaussian spectra). These things are all properties on the NXbeam class, so preserving that interface in dxtbx makes sense to me. After all, a spectrum and a beam shape are different, and they are both something that a beam has. If you still think that a beam object should produce a spectrum object, then what would the base class of the spectrum object be, such that other things could derive from it, such as a beam profile? Ideas:
|
I've got a very naive question, from a chemist's perspective, about this:
What-all is recorded in the "spectrum" that it requires this much memory? I would've thought a spectrum could be expressed as a 1-dimensional list: measurement start (e.g. 120 Angstrom), step size (e.g. 5 Angstrom), and then the list of counts. What else is there that requires 140+ MB? |
So it's a great question. These spectra are 2500 energy channels and 2500 'weights' per image. Here's an example spectra from SwissFEL (weights vs. eV): In the imported json file, each number is represented in ASCII and seperated by things like newlines, spaces, and commas, that's what adds up to 144 MB. The new version which uses an external pickle file for the spectra, is more like 18MB, plus it only saves the energy channels once. Much more realistic for 1400*2500*double precision for the whole set. |
Okay, the amount of information you're referring to is oceanically vast compared to the graph of a spectrum that you're showing, which to my naive eye looks like a 1D array of ~100 integers. Is there a reason to keep all the "things like newlines, spaces, and commas", as well as all the other stuff that adds up to 144 MB? Do you use - or anticipate using - any of that in any calculations? |
Rounding to int is possible, maybe even reasonable, but yah, it's 2500+
channels. Reducing the number of channels would require smoothing and that
defeats the point of the analysis we want to do.
And right, the pickle version just saves the flex arrays so it's as compact
as a double can be, no other nonsense. Even so, the best approach would be
links to the original data, which could be done if we moved to HDF5 (see
#1238).
…On Fri, Apr 17, 2020 at 2:25 PM Artem Lyubimov ***@***.***> wrote:
Okay, the amount of information you're referring to is oceanically vast
compared to the graph of a spectrum that you're showing, which to my naive
eye looks like a 1D array of ~100 integers. Is there a reason to keep all
the "things like newlines, spaces, and commas", as well as all the other
stuff that adds up to 144 MB? Do you use - or anticipate using - any of
that in any calculations?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#158 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADGY5SSSUGN6KVPQ5EHPH33RNDCNNANCNFSM4MEICSZA>
.
|
So yah maybe that is not so easy 🤔 |
Ok, now that #151 is merged, I've changed the base of this PR to master, rebased on master, and I've added 69d1d56 to move the spectra into pickle files instead of the json text. This is again ready for review. @graeme-winter if you have requests for a different object model could you restate them here? |
Rather than use pickle how about use hdf5 ? I worry about py2/3 compatibility with pickle. |
There is also a security risk with pickle, which means it should ideally be avoided in cases like this when you are just writing data. See dials/dials#512 for details. This is one of the main reasons that we switched to using message pack for reflection tables. |
Ok, I've got a version using simple h5 files now commited. Thoughts? |
Stats update. Pickle version (as before): New h5 version: |
For what it's worth I am not hostile to the pickle solution, since this is
Since we need to do $something properly anyway for masks maybe we could do those at the same time? Or at the very least not come up with two different solutions to one problem. |
The trouble with flags is that they're often just left waving in the breeze. Replacing pickle for masks is long overdue. Closing up on a year now: dials/dials#821 |
@dagewa you are not wrong -> guess we need to define how we are going to store this data in HDF5 then (such that the spectra do not somehow collide with masks, and we don't find ourselves being painted into a corner, and we do not have to keep track of many auxiliary files) -> 🤔 |
Bump. I'd really like to resolve this as the NeXus Gold Standard paper is in revisions again. |
Adds these methods: set_spectrum(flex.double energies, flex.double weights) get_spectrum_energies() get_spectrum_weights() get_weighted_wavelength() Two flex arrays, energies (eV) and weights (dimensionless), represent the spectrum as a scatter plot. This follows the conventions set up in nexusformat/definitions#717. The new function get_weighted_wavelength uses a simple weighted mean to determine a wavelength. It is ok if the beam has a wavelength != get_weighted_wavelength. This represents a calibrated wavelength determined from processing the spectrum. is_similar_to tests get_weighted_wavelength between the two beams, if a spectrum is present. Includes tests.
- On serializing, create an external pickle file with the beam spectra and replace the spectra arrays in the dictionary with indices into the pickle file - On deserializing, read the external pickle file back and set the spectra appropiately - Change to_dict and from_dict to use flex.doubles instead of lists of floats. Not a primitive dictionary any longer but much faster to read/write
Rebased onto master. |
Hi folks, we want to merge this. We are needing it for our research, specifically for a DOE milestone due end of September. If there are objections, please sing out and let us know ASAP. We recognize that the ideal solution would resolve this and dials/dials#1238 at the same time, but we can't make the perfect the enemy of the good. Thanks. |
The last time the tests were run on this PR they failed. I've just updated the branch to see what the current status is. |
@phyy-nx I am in a difficult position here I have raised concerns with this approach above e.g. one specific NeXus implementation is what we have to use for any spectral information rather than other approaches which may be computationally well justified e.g. for a single stable undulator peak from a pink beam source like VMXi. I am also not happy with "NeXus defines how you are allowed to write data analysis software" as a concept. Has the serialisation issue discussed above been resolved? Re: " It is very easy to write backwards compatibility into the deserialization methods if the object model changes." - every time we have been in this situation it has proven impossible to revisit, because it gets published somewhere and must never change after that. This will I expect happen here. However I am in a really nasty situation here. None of the concerns I have raised have been addressed (as far as I am aware) however to not merge this is going to prevent DOE grant deliverables being delivered and I will be held responsible for this. So, the only winning move is not to play. |
If I understand this correctly, the master.h5 file contains both the raw image data, and per-image beam spectra? And after importing stage, you have extracted the metadata, and an external data file containing only the per-image beam spectra? Yet these files are somehow (almost) as large as the master.h5 which contains the entire raw image data? Perhaps I'm missing something, but I don't understand why it is necessary to extract and store the per-image beam spectrum in an external pickle/h5 file? Couldn't this be obtained from the master.h5 file on demand, analagously with how we access the raw image data via |
Ah, I see you touched on this point right at the top of the PR:
Perhaps this is something that could be explored in more detail, as this seems like it would be a much preferable solution, if we can find a way to make it work. |
Although |
👍 I like the suggestion of |
What's being asked here is can we avoid serializing the spectrum? Therefore we have a Fundamental Question: is the spectrum immutable? Can it never be refined in any way? If so, then it doesn't need to be serialized. We currently have a use case where we think the spectrum energy channels may be wrong by a constant but unknown offset that we need to refine. Under this PR, we can trivially update the energy channels in our experiment list and serialize the result for downstream processing. If the spectrum is read-only then we need another approach, perhaps modifying the raw data itself. Now as it turns out, in NeXus this is easy, as it supports variants that allow us to update the metadata, so maybe that's ok. Does seem to defeat the purpose of having experimental models though. |
Adds these methods:
set_spectrum(flex.double energies, flex.double weights)
get_spectrum_energies()
get_spectrum_weights()
get_weighted_wavelength()
Two flex arrays, energies (eV) and weights (dimensionless), represent the spectrum as a scatter plot. This follows the conventions set up in nexusformat/definitions#717. The new function get_weighted_wavelength uses a simple weighted mean to determine a wavelength.
It is ok if the beam has a wavelength != get_weighted_wavelength. This represents a calibrated wavelength determined from processing the spectrum.
is_similar_to tests get_weighted_wavelength between the two beams, if a spectrum is present.
Includes tests.