Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argument subsample does not work for a CoregPipeline #428

Closed
rhugonnet opened this issue Sep 2, 2023 · 4 comments · Fixed by #436
Closed

Argument subsample does not work for a CoregPipeline #428

rhugonnet opened this issue Sep 2, 2023 · 4 comments · Fixed by #436

Comments

@rhugonnet
Copy link
Member

Could also solve #137 at the same time...

@rhugonnet
Copy link
Member Author

rhugonnet commented Sep 5, 2023

Actually this is quite limiting because the old Deramp used to have an implicit subsample (of 500 000 samples) inside the class itself, and now this is not possible so it takes ages.

@rhugonnet
Copy link
Member Author

rhugonnet commented Sep 6, 2023

Alright, trying to brainstorm the subsample functionality to be future-proof before implementing new changes!

What do we have now?

Main problem: for many methods, we need the spatial dimension OR spatial elevation information of the inputs to run the coreg. For example: deriving slope/aspect in NuthKaab, deriving map coordinates in Deramp, deriving along-track coordinates in DirectionalBias, etc... But this is only known once you call the subclass.
Additionally, we currently don't provide any way of having a custom subsample for each step of a pipeline: #137.
Finally, we need to control the related random sampling: #243.

Ideas for solution

We could combine all of the following changes:

  1. Add an optional subsample argument for every Coreg subclass in __init__() that will mainly be used for pipelines (otherwise can be passed to fit()). This would solve Allow custom subsampling for each CoregPipeline step separately. #137.
  2. Call fit() instead of _fit_func in CoregPipeline.fit(), to re-do the subsampling step for every pipeline step: https://github.com/GlacioHack/xdem/blob/main/xdem/coreg/base.py#L1439. We'll need to ensure this won't duplicate other processing steps (= all other steps are skipped if inputs are already in proper array/transform/crs format).
  3. Pass a boolean mask of the subsample instead of masking values as NaNs: lightweight, and preserves information. By default, we could subsample always on the "reference", so that it's fully reproducible (and add this to the doc, because can make a big difference).
  4. Perform the actual subsampling within each subclass NuthKaab, Deramp, etc... Maybe consistently calling a non-public parent Coreg method.

But: With this workflow, the "data necessary for the coreg" would always be processed/subsampled by code written in the subclass method itself, instead of a centralized pre-processing function of Coreg... It could be OK, but I'm not fully sure if this is ideal. Right now we always read all the data at once, but with an Xarray accessor it would mean that we'd have to read/combine data chunks by calling operations in the subclass directly, and repeat that code over all subclasses that would need to accept da.array objects in addition to np.array.
(Because co-registration methods usually require regression or optimization, which cannot be "combined" from several subruns, all data to optimized has to be loaded. This means that the main way to combine raster chunking with Coreg would be through subsample by reading random subsample per chunk, never loading the entire Raster in-memory, and eventually returning the full subsample to memory only.)

Thinking more about it, I think it's fine, as long as we organize our functions nicely! 😄
In short, the structure would be:

  • Coreg.__init__ can be passed a subsample and random_state argument, stored and available for all steps,
  • Coreg.fit() runs checks on input data, pre-processes it only if necessary (need to be skipped when called by CoregPipeline) and returns either np.array or da.array with a np.array boolean mask to _fit_func(),
  • CoregSubclass._fit_func() gets np.array or da.array and derives additional variables (e.g., slope, coordinates, angle) potentially in a lazy chunked way (overlap necessary for terrain attributes), then performs the subsampling potentially in a lazy chunked way and passes 1D vectors of subsampled data to a function fit_subclass (e.g., fit_nuthkaab) that lives outside of the classes,
  • The fit_subclass function optimizes the fit with input vectors, and returns the optimized coregistration parameters!

Any thoughts @erikmannerfelt @adehecq?

@adehecq
Copy link
Member

adehecq commented Sep 7, 2023

Mmmh, difficult task !
Some thoughts regarding your suggestion:

  • point 1: ok
  • point 2: I don't understand why _fit_func cannot be used anymore...
  • point 3: Ok for the mask. Or maybe list of indices to be used?
  • point 4: could each subclass not just have a subsampling method that would be called during fit?

Something else I'm thinking. For some methods, it makes sense to randomly subsample the points (e.g. Tilt, Deramp, VerticalShift...), for others, it would make more sense to simply downgrade the resolution (e.g. NuthKaab). Maybe we could get away with two generic subsampling methods, one than randomly samples and one that downgrade?

@rhugonnet
Copy link
Member Author

rhugonnet commented Sep 7, 2023

Yes, not easy!
Good points, some more thoughts:

  • Point 2: It would be so that fit() can re-perform the subsampling (that can change for every method) + method attribution (P-P, R-P or R-R) consistently before sending to the _fit_func, without having to re-write that logic in every _fit_func (for subsampling) or CoregPipeline (for method attribution). Right now we call apply() in CoregPipeline for the same reason! This also answers Point 4 actually.
  • Point 3: True, the vector of indices is probably shorter in length, but then the integer data type likely makes it larger in memory than a full boolean array. Additionally, a list of indices would not be easily chunkable by Dask in relation to the DEMs, while we could derive a boolean mask in an out-of-memory way fairly easily...

Good point for the downsampling, actually I know a lot of users that ran into memory problems just because of that. There should be a logic in the pre-processing that always derives the fit from the lowest resolution data by default. I'll open an issue so we don't forget!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants