Replies: 6 comments 2 replies
-
Great discussion of the complexities of the analysis challenges. Given the flexibility that is needed to address the spectrum of questions people might have, I've always been a fan of the "node-red"-like visual programming concept to couple models to optimizers and interpolators. This can be a pain to maintain too, though, and needs some serious work on robust dataclasses underpinning all the components. |
Beta Was this translation helpful? Give feedback.
-
Clearly I share some node-red sentiments (cf reductus) 😉 Even if we don't make it explicit in the UI we can still structure the calculation as a series of transforms that are chained together and fed to an execution engine. Our case is somewhat unusual in that information about q, Δq is flowing left while I(q) is flowing right, so we don't have a simple directed acyclic graph (DAG) like other systems. You will need use case analysis when developing your graphical language. For example, your diagrams should be able to handle combined batch fitting. Without some care you will run into the usual problems of loops and conditionals that are awkward in a labview-like visual programming environment. Cut-and-paste diagrams are even harder to maintain than cut-and-paste code. |
Beta Was this translation helpful? Give feedback.
-
I see your Reductus (and I like it), and I raise you a universally applicable modular data correction pipeline. Now also under development as a fast Python library. Like your interface for Reductus, the universal pipeline was pretty well implemented in DAWN as a single column of elements to apply, and a separate (Eclipse) UI around it with paraphernalia where you could select files, plot various things and other "schnickischnacki" as they say here. But I understand the problem of LabView and its tendency to result in horrible visual spaghetti-code. I've seen beamlines during my time at SPring-8 which were run (poorly) using LabView, and those LV scripts were absolutely unintelligible for everyone but the person who wrote it. I guess that's one way to get job security. On the other hand, set-ups like that might make complex data analysis like multimodal analyses manageable. Like this structure we envisioned for a rejected EU proposal: For such things, it's impossible to provision for all possible combinations in a traditional SASView-like GUI. That, in combination with the ever-increasing complexity of the SasView GUI for an ever broader spectrum of possible SAS analyses, convinces me that another, more modular and flexible approach would be required. One way to avoid complexities might be to separate the analysis into isolatable steps. So that you would set up an analysis pipeline for a single (set of) data, and leave the optional batch handling and visualisations outside. With HDF5, each analysis could be stored alongside the dataset it analyzes (like we do for McSAS3 now), and visualisation of particular analyses can be done separately at a later stage. As you say, however, it needs specifying from the start exactly where the boundaries are of each component, otherwise it'll quickly end up as a mess. (as a side-note, I very recently started coding v.3 of a data (re-)binning and merging tool, this time making heavy use of (attrs-defined) dataclasses. That really forces one to think at the most initial stages on what will travel where and how. Making the dataclasses took several days, but the actual functional part of the code is now so much cleaner. I can easily imagine such dataclasses being a foundation of modules for a UI as they very strictly define the inputs and outputs). As you say, under the hood it can (and probably should) be as modular as possible, allowing for this chaining up. However I don't understand you mean with "q, ∆q is flowing left"? As far as I understand from your Original Post (OP), the q, I, their uncertainties and smearing matrices may have to be sliced or diced in different ways but generally travel as a single object with associated methods and attributes, no? |
Beta Was this translation helpful? Give feedback.
-
Github now allows diagrams: graph LR;
model --"I(qcalc)"-->resolution["G⊛I"]
resolution--"I(q)"-->chisq["χ²"]
data--"q,Δq,I,δI"-->chisq
chisq--"q,Δq"-->resolution
resolution--"qcalc"-->model
|
Beta Was this translation helpful? Give feedback.
-
Note that background and scale are properties of the measurement (mostly) and not the model. That is, they are fittable parameters that are added between G⊛I and χ² in the diagram above. This restructuring would remove the special code for managing scale and background in each model parameter table, and in particular for the composite models Some transformations may have additional parameters, such as path length and sample thickness for multiple scattering. |
Beta Was this translation helpful? Give feedback.
-
I've got quite a bit of experience with flow diagram type interfaces, and I'm thinking more and more that it might be worth the trouble in SasView. I can knock up a basic one once I've ticked of a couple of other things on my list. |
Beta Was this translation helpful? Give feedback.
-
SasviewModel needs to go away and several sasmodels interfaces need to be updated.
At the heart of the theory calculation is a collection of Q calculation points. Each Q point in the measured/simulated data needs a basis of support selected from the calculation points in order to apply instrumental resolution and other transforms.
For simple 1D data where the point density is high and the sample is small this is just the measured points plus some wings above and below so that resolution can be computed on the ends. If part of the data is masked there may need to be some points added to the middle. If the model shape is large there may be high frequency components in the scattering, which means that we may need calculation points between measured q points (oversampling).
For USANS with slit geometry the resolution from the measured points can pull from very high q, with weights distinct from pinhole SANS. For oriented samples the theory needs to be computed over (qx, qy) then integrated over the slit geometry. With some cleverness, angular dispersion in φ and perhaps θ could be incorporated into the resolution function rather than calculating it within the theory function but that probably isn't worth the additional code complexity.
Multiple scattering is computed with the fast fourier transform, which requires regularly spaced points in (qx, qy). For 1D measurements this is followed by circular integration to recover I(q). The q points have to be dense enough to interpolate into calculated q values required for the measured q, Δq resolution. If Δq is roughly constant then resolution could be calculated quickly with the fourier transform. (Question: can we calculate 1-D multiple scattering effects by scaling a SESANS model and transforming back?)
For SESANS a set of log-spaced q values is used as input the hankel transform when computing G(ξ) at computed correlation lengths ξ. There should be resolution applied to G(ξ) due to effects such as wavelength dispersion and angular divergence, so the computed ξ may require points beyond the measured ξ.
Once you have the theory function you can subtract from the data to compute the residuals. For simultaneous fitting you want to do this for multiple models after applying parameter constraints.
When running SasView the user will sometimes want to see intermediate results from these transformations. This is provided by a calculation tree in the user interface. In the simplest case the tree contains the q, Δq, I(q) data, the theory at q_calc, and the output of the resolution function. For multiple scattering there may be intermediate values of interest, such as the 2D pattern before integration back down to 1D and before any resolution calculation. Similarly for SESANS. For composite models this will include the individual components, such as A, B and C for A+B+C, or the structure factor in P@S.
Additional artifacts are available such as the profile shape for the onion model or the distributions for polydisperse parameters. Computed values such as the effective radius used for the structure factor calculation need to be returned. Within the models there may be values of interest such as the particle volume computed from the shape parameters.
This suggests an architecture where each calculation component can decorate a return structure with its name, its artifacts and return structure for each of its internal components. That is, the onion model attaches q_calc, I(q_calc), radius_effective, the profile σ(r), the parameter weight distributions, and the weighted average form volume; the hardsphere model attaches q_calc and S(q_calc); the structure factor model attaches the computed volume fraction and P@S; the multiscattering calculator attaches qx, qy, and I(qx,qy) after multiple scattering as well as the q_calc', I(q_calc'); and the pinhole resolution function attaches q, I(q), measured I(q) and measured Δq; the log likelihood calculator attaches the residuals, the equivalent normalized χ²; the simultaneous modeler attaches the constrained parameter values used for each model and the combined nllf.
Moving into fitting, the optimizer attaches the above structures for the best fit model as well as artifacts from the fitter such as the convergence plot and the parameter distributions; the batch fitter attaches the results for each dataset in the series, along with a table of control values and fitted parameter values for each dataset. Trends such as concentration vs. radius can then be modeled using parameter uncertainty from the individual fits. If you believe your models then the entire batch could be fitted simultaneously with the parameterized trend model constraining the parameter value for each concentration.
Modular architectures are great for rapid development and long term maintenance, but they also have disadvantages. For example, the intermediate artifacts are not needed for fitting, only for final display. Either they need to be cheap to manage, or we need one interface for fits and another for retrieving artifacts. Instead of predetermining q_calc, we may want to generate it adaptively depending on the sample parameters or the derivative of the theory function. Since the entire calculation amounts to high dimensional integral, we may find that it is much more efficient to use Monte Carlo integration across parameter distributions and Q values, but this completely destroys modularity. More advanced structure factor calculations will require integration P*S over the size distribution rather than computing P and S then combining the results further destroy modularity, pushing P and S into the same calculation kernel.
The current calculation is parallel along q, but for 1D models and modern GPUs the number of cores is much larger than the number of q points. We could unroll the loops in the models (over size and over angle) to make better use of the card, possibly with pseudo Monte Carlo techniques to reduce computation (there are more efficient ways to sample a multidimensional integral than using a dense grid). Most 1D models include loops of θ and maybe φ and ψ which can be unrolled. The models could be restructured so that this is done generically, with the bonus that we can increase sampling density in θ, φ for elongated shapes. Some models such as pringle have an additional loop for some internal integral; the onion model and the like integrate over a series of shells. It'll probably be too hard to access this level of parallelism.
Unrolling the integrals on the GPU will speed up individual model evaluations. An alternative approach would be to allow multiple parameter sets to be computed in parallel. This may require tighter integration with the fitting engine, further destroying modularity. Or maybe it will just work to have multiple threads share the same GPU. Individual calculations will still take the same time but population fitters should be much faster.
Moving the resolution functions onto the GPU can provide further performance improvements, especially if we can leave the intermediate data on the card between the steps of the calculation. Similarly, we can calculate the residuals from the theory and data on the GPU so during the fit we only need to transfer the parameter set at the start of the calculation and the nllf at the end. We could potentially move the optimizer onto the GPU but with diminishing returns.
We might try implementing an nllf calculator for a SANS model in torch, using cupy to build the individual kernels from the C models. The advantage of torch is that you can use your modular design to construct your computational workflow and the torch infrastructure will manage the execution on the GPU and/or CPU.
Related issues:
#839
SasView/sasmodels#516
SasView/sasmodels#272
SasView/sasmodels#269
#1213
SasView/sasdata#17
Beta Was this translation helpful? Give feedback.
All reactions