Flatten / Reshape data cube dimensions #308

m-mohr · 2021-11-26T16:06:02Z

There are sometimes use cases that need to "flatten" (or stack: xarray, pands) data cube dimensions. Right now, VITO uses "apply_dimension" + target_bands as a workaround, but that may not fully be covered through the specification.

We need to check whether we really want to use that approach long-term, it is a bit weird to use a const operation as callback.
The better approach could be to actually define a new process.

This is already required by multiple use cases: SRR2 UC3, SRR3 UC8
It has already been discussed as part of two other issues at least:

clausmichele · 2021-11-29T15:37:19Z

So, the input could be:

x, y, bands, time
x, y, bands
x, y, time
other?

in my opinion the result should have a shape of MxN, resulting from a combination of the available ones in the datacube.
Hence, the process should allow to recombine depending on the user's input.

Practical example:

I train a random forest regressor with NDVI values as target and [B04,B08] as predictors/features.
For training, if the data comes out an aggregate_spatial process has already a MxN shape. (M=number of polygons, N=number of bands). This is possible since aggreate_spatial removes the spatial bands (x,y).
For predicting, the data must have again the MxN shape. No problems if we use again aggregate_spatial. However, if we want to apply the prediction over a raster-cube, we need to flatten the data first.
If the input data to prediction is then x,y, bands (with bands= [B04,B08] for this particular example), we need to flatten it to MxN, with M= x * y and N = 2 (the number of bands).

If the input data has also the time dimensions, we need to allow a result like:

M = x * y * time N = bands
M = x * y N = bands * time (time series regression)

Anyway, we will lose the necessary information for reshaping the output of the machine learning algorithm, so maybe we will also need another process to reshape the output or having a more general reshape process allowing to flatten the data but also reconstruct it following a sample datacube (the data before flattening for instance.)

references:
numpy flatten
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flatten.html
numpy reshape
https://numpy.org/doc/stable/reference/generated/numpy.reshape.html

m-mohr · 2021-12-13T17:02:56Z

Porting over a discussion from #306 - posted by @jdries:

in our case, the flattening has been taken care of by apply_dimension, but it's fine if another process is defined for that (doing the same thing)

We've just checked the process description closely and we believe that the behavior is not covered by the process description of apply_dimension. As far as I've understood, it is that if you have a data cube x,y,t,b with 5 labels in t (as parameter dimension) and 3 labels in b (as parameter target_dimension) that you somehow want to combine b with t (like a matrix multiplication) and end up with 15 labels. That is not covered by the specification as you don't have "read" access to b but instead simply replace it in the target dimension:

The pixel values in the target dimension get replaced by the computed pixel values.

It is possible that I've misunderstood how you envision the flattening approach through apply_dimenesion and it would be good to look at an example. What I found in UC3 used for computeStats doesn't really seem to be flattening and as such the behavior in UC3 seems fine. Having that said, it seems like the better approach to make it possible through a new process right now. In this case, it would also not necessarily be required to generate new labels through the client. The flattening process would be based upon existing implementations in e.g. xarray and should not be too hard to implement then.

cc @edzer - How does a user flatten in stars? (edit: st_redimension)

jdries · 2021-12-14T08:45:46Z

Looking at the example above, we also typically solve that one without flattening.
What we would do is:
cube = (x,y, t, bands)
ndvi = cube.reduce_dimension(dimension='bands', callback=random_forest_inference)

The random_forest_inference callback then simply gets the 2 band values per pixel and timestep, and predicts the NDVI. Also our more complex cases based on deep learning work like that, no flattening and reshaping is needed.

The big problem lies more with training models, because that is a 'global' operation that can not be split up using the callback approach. On the other hand, the point sampling through aggregate_spatial does solve the problem of 'flattening' the spatial dimensions.

clausmichele · 2021-12-15T16:22:38Z

In my opinion, we would need a reshape process that does raster-cube to vector-cube and vice-versa (or two separate processes)

When the input is a raster-cube it flattens the data to a vector-cube:

Input: (x,y,time,band) with shape (M,N,T,B) = (10,20,100,2)
Parameters: "predictor dimension": "bands"
Output: vactor-cube with shape (M * N * T, B) = (20000,2)

Input: (x,y,time,band) with shape (M,N,T,B) = (10,20,100,2)
Parameters: "predictor dimension": "time"
Output: vactor-cube with shape (M * N * B, T) = (400,100)

Input: (x,y,time) with shape (M,N,T) = (10,20,100)
Parameters: "predictor dimension": "time"
Output: vactor-cube with shape (M * N, T) = (200,100)

Input: (x,y,band) with shape (M,N,B) = (10,20,2)
Parameters: "predictor dimension": "time"
Output: vactor-cube with shape (M * N, B) = (200,2)

When the input is a vector-cube it reshapes the data to a raster-cube given a target cube.
It would raise an error if the vector-cube can't be reshaped do to a mismatch.

Using combinations of apply_dimension and reduce_dimension might be difficult to understand for someone with a machine learning background and as we have seen it does not cover all the possible scenarios.

jdries · 2021-12-15T20:03:34Z

Could it be that we have some confusion on the 'vector-cube' concept?
You seem to be thinking of something like a big matrix. Which is a lot like a generalization of rastercube (in the sense that it doesn't have spatial dimensions).
My idea of the concept is more similar to a 'featurecollection' in GeoJSON. So a cube where spatial dimension is replaced with geometric shapes (polygons/lines/points/...) and that can otherwise still have a time and bands dimension.

edzer · 2021-12-16T08:02:04Z

Could it be that we have some confusion on the 'vector-cube' concept?

I tried to clarify some of this in #68.

clausmichele · 2021-12-16T08:06:51Z

Could it be that we have some confusion on the 'vector-cube' concept? You seem to be thinking of something like a big matrix. Which is a lot like a generalization of rastercube (in the sense that it doesn't have spatial dimensions). My idea of the concept is more similar to a 'featurecollection' in GeoJSON. So a cube where spatial dimension is replaced with geometric shapes (polygons/lines/points/...) and that can otherwise still have a time and bands dimension.

That's also fine, but to train a ML model we do need that this vector-cube or how we want to call it, has just 2 dimensions. So it could also have the structure that you mentioned, where each row has also the (x,y) or polygon property that generated it, but we still need a process that reshapes back and forth the data.

jdries · 2021-12-16T13:08:21Z

Not sure if I agree. To train an ML model (and also for inference), we need to provide a matrix to the model, where the shape of that matrix indeed depends on the model.
Then it seems that we now have two proposals to achieve that:

Reshaping the entire raster cube into a more general matrix
Reusing existing callback based methods (reduce_dimension, apply_neighborhood,...)

My biggest problems with the reshaping proposal:

If implemented literally, it implies large data reorganization inside the backend, for no clear reason
It consists of a couple of steps: reshape -> apply ML model -> reshape back to rastercube shape, with some kind of target cube (where does that come from?)
Backend specific, but I would need to design entirely new datastructures to work with this new type of cubes. (It's basically a 3rd type, next to raster, and geometry)

The main argument for reusing the existing processes is simply that we have them already, and we have to teach our users anyway how to work with them. I agree that these are not the most simple processes, but for EO researcher that have ambitions to use machine learning and probably deep learning as well, this should be will within their skillset.

Open-EO/openeo-processes#308 Open-EO/openeo-processes#316

m-mohr added the new process label Nov 26, 2021

m-mohr added this to the 1.2.0 milestone Nov 26, 2021

m-mohr self-assigned this Nov 26, 2021

m-mohr modified the milestones: 1.2.0, 1.3.0 Nov 29, 2021

m-mohr added the ML label Dec 13, 2021

m-mohr mentioned this issue Dec 13, 2021

Processes for Random Forest #306

Merged

m-mohr mentioned this issue Dec 15, 2021

Compute new labels (e.g. bands) easily #233

Open

edzer mentioned this issue Dec 16, 2021

Vector data cubes + processes #68

Closed

m-mohr changed the title ~~Flatten data cube dimensions~~ Flatten / Reshape data cube dimensions Dec 16, 2021

m-mohr added a commit that referenced this issue Dec 20, 2021

"Implode" dimensions #308

7807e4c

m-mohr linked a pull request Dec 20, 2021 that will close this issue

Combine dimensions and vice-versa #308 #316

Merged

m-mohr added a commit that referenced this issue Dec 20, 2021

"Explode" dimensions #308 and fixes

dc692d5

m-mohr added a commit that referenced this issue Dec 20, 2021

"Explode" dimensions #308 and fixes

b19a876

m-mohr added a commit that referenced this issue Mar 9, 2022

Combine dimensions and vice-versa #308 (#316)

e35081a

soxofaan added a commit to Open-EO/openeo-python-client that referenced this issue Mar 9, 2022

Add DataCube.flatten_dimensions() and DataCube.unflatten_dimension

e20b6f7

Open-EO/openeo-processes#308 Open-EO/openeo-processes#316

m-mohr closed this as completed Mar 9, 2022

m-mohr modified the milestones: 1.3.0, 2.0.0 Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flatten / Reshape data cube dimensions #308

Flatten / Reshape data cube dimensions #308

m-mohr commented Nov 26, 2021

clausmichele commented Nov 29, 2021 •

edited by m-mohr

Loading

m-mohr commented Dec 13, 2021 •

edited

Loading

jdries commented Dec 14, 2021

clausmichele commented Dec 15, 2021

jdries commented Dec 15, 2021

edzer commented Dec 16, 2021

clausmichele commented Dec 16, 2021

jdries commented Dec 16, 2021

Flatten / Reshape data cube dimensions #308

Flatten / Reshape data cube dimensions #308

Comments

m-mohr commented Nov 26, 2021

clausmichele commented Nov 29, 2021 • edited by m-mohr Loading

m-mohr commented Dec 13, 2021 • edited Loading

jdries commented Dec 14, 2021

clausmichele commented Dec 15, 2021

jdries commented Dec 15, 2021

edzer commented Dec 16, 2021

clausmichele commented Dec 16, 2021

jdries commented Dec 16, 2021

clausmichele commented Nov 29, 2021 •

edited by m-mohr

Loading

m-mohr commented Dec 13, 2021 •

edited

Loading