Argument `subsample` does not work for a `CoregPipeline` #428

rhugonnet · 2023-09-02T02:11:27Z

Could also solve #137 at the same time...

rhugonnet · 2023-09-05T23:23:10Z

Actually this is quite limiting because the old Deramp used to have an implicit subsample (of 500 000 samples) inside the class itself, and now this is not possible so it takes ages.

rhugonnet · 2023-09-06T02:03:25Z

Alright, trying to brainstorm the subsample functionality to be future-proof before implementing new changes!

What do we have now?

In Coreg.fit(), in order to keep the spatial information of the input rasters, the subsample argument masks the values not to be used as NaN: https://github.com/GlacioHack/xdem/blob/main/xdem/coreg/base.py#L365. This is less than ideal: heavy-weight, will propagate NaNs in derivatives (e.g. slope), or in pipelines...
In Tilt(), we have a subsample argument defined during __init__: https://github.com/GlacioHack/xdem/blob/main/xdem/coreg/affine.py#L570.

Main problem: for many methods, we need the spatial dimension OR spatial elevation information of the inputs to run the coreg. For example: deriving slope/aspect in NuthKaab, deriving map coordinates in Deramp, deriving along-track coordinates in DirectionalBias, etc... But this is only known once you call the subclass.
Additionally, we currently don't provide any way of having a custom subsample for each step of a pipeline: #137.
Finally, we need to control the related random sampling: #243.

Ideas for solution

We could combine all of the following changes:

Add an optional subsample argument for every Coreg subclass in __init__() that will mainly be used for pipelines (otherwise can be passed to fit()). This would solve Allow custom subsampling for each CoregPipeline step separately. #137.
Call fit() instead of _fit_func in CoregPipeline.fit(), to re-do the subsampling step for every pipeline step: https://github.com/GlacioHack/xdem/blob/main/xdem/coreg/base.py#L1439. We'll need to ensure this won't duplicate other processing steps (= all other steps are skipped if inputs are already in proper array/transform/crs format).
Pass a boolean mask of the subsample instead of masking values as NaNs: lightweight, and preserves information. By default, we could subsample always on the "reference", so that it's fully reproducible (and add this to the doc, because can make a big difference).
Perform the actual subsampling within each subclass NuthKaab, Deramp, etc... Maybe consistently calling a non-public parent Coreg method.

But: With this workflow, the "data necessary for the coreg" would always be processed/subsampled by code written in the subclass method itself, instead of a centralized pre-processing function of Coreg... It could be OK, but I'm not fully sure if this is ideal. Right now we always read all the data at once, but with an Xarray accessor it would mean that we'd have to read/combine data chunks by calling operations in the subclass directly, and repeat that code over all subclasses that would need to accept da.array objects in addition to np.array.
(Because co-registration methods usually require regression or optimization, which cannot be "combined" from several subruns, all data to optimized has to be loaded. This means that the main way to combine raster chunking with Coreg would be through subsample by reading random subsample per chunk, never loading the entire Raster in-memory, and eventually returning the full subsample to memory only.)

Thinking more about it, I think it's fine, as long as we organize our functions nicely! 😄
In short, the structure would be:

Coreg.__init__ can be passed a subsample and random_state argument, stored and available for all steps,
Coreg.fit() runs checks on input data, pre-processes it only if necessary (need to be skipped when called by CoregPipeline) and returns either np.array or da.array with a np.array boolean mask to _fit_func(),
CoregSubclass._fit_func() gets np.array or da.array and derives additional variables (e.g., slope, coordinates, angle) potentially in a lazy chunked way (overlap necessary for terrain attributes), then performs the subsampling potentially in a lazy chunked way and passes 1D vectors of subsampled data to a function fit_subclass (e.g., fit_nuthkaab) that lives outside of the classes,
The fit_subclass function optimizes the fit with input vectors, and returns the optimized coregistration parameters!

Any thoughts @erikmannerfelt @adehecq?

adehecq · 2023-09-07T09:29:13Z

Mmmh, difficult task !
Some thoughts regarding your suggestion:

point 1: ok
point 2: I don't understand why _fit_func cannot be used anymore...
point 3: Ok for the mask. Or maybe list of indices to be used?
point 4: could each subclass not just have a subsampling method that would be called during fit?

Something else I'm thinking. For some methods, it makes sense to randomly subsample the points (e.g. Tilt, Deramp, VerticalShift...), for others, it would make more sense to simply downgrade the resolution (e.g. NuthKaab). Maybe we could get away with two generic subsampling methods, one than randomly samples and one that downgrade?

rhugonnet · 2023-09-07T19:54:35Z

Yes, not easy!
Good points, some more thoughts:

Point 2: It would be so that fit() can re-perform the subsampling (that can change for every method) + method attribution (P-P, R-P or R-R) consistently before sending to the _fit_func, without having to re-write that logic in every _fit_func (for subsampling) or CoregPipeline (for method attribution). Right now we call apply() in CoregPipeline for the same reason! This also answers Point 4 actually.
Point 3: True, the vector of indices is probably shorter in length, but then the integer data type likely makes it larger in memory than a full boolean array. Additionally, a list of indices would not be easily chunkable by Dask in relation to the DEMs, while we could derive a boolean mask in an out-of-memory way fairly easily...

Good point for the downsampling, actually I know a lot of users that ran into memory problems just because of that. There should be a logic in the pre-processing that always derives the fit from the lowest resolution data by default. I'll open an issue so we don't forget!

rhugonnet mentioned this issue Sep 6, 2023

Way forward for co-registration module #435

Open

rhugonnet mentioned this issue Sep 23, 2023

Improve subsample across Coreg subclasses and pipelines #436

Merged

rhugonnet closed this as completed in #436 Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Argument `subsample` does not work for a `CoregPipeline` #428

Argument `subsample` does not work for a `CoregPipeline` #428

rhugonnet commented Sep 2, 2023

rhugonnet commented Sep 5, 2023 •

edited

Loading

rhugonnet commented Sep 6, 2023 •

edited

Loading

adehecq commented Sep 7, 2023

rhugonnet commented Sep 7, 2023 •

edited

Loading

Argument subsample does not work for a CoregPipeline #428

Argument subsample does not work for a CoregPipeline #428

Comments

rhugonnet commented Sep 2, 2023

rhugonnet commented Sep 5, 2023 • edited Loading

rhugonnet commented Sep 6, 2023 • edited Loading

adehecq commented Sep 7, 2023

rhugonnet commented Sep 7, 2023 • edited Loading

Argument `subsample` does not work for a `CoregPipeline` #428

Argument `subsample` does not work for a `CoregPipeline` #428

rhugonnet commented Sep 5, 2023 •

edited

Loading

rhugonnet commented Sep 6, 2023 •

edited

Loading

rhugonnet commented Sep 7, 2023 •

edited

Loading