Padding arrays with NaNs #22

TomNicholas · 2024-03-11T15:35:28Z

We want to be able to pad arrays with NaN values (/ zarr array fill_value) because some kerchunk use cases involve padding data (e.g. satellite swathes that produce slightly differently shaped data each day such as NASA SWOT).

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-03-19T16:00:33Z

We should probably be able to make the xarray .pad method work, at least with very specific arguments, e.g. ds.pad(x=1, mode='constant'). We would then intercept the arguments and re-interpret the padding operation for however the underlying manifest actually works.

dcherian · 2024-03-19T16:35:35Z

satellite swathes that produce slightly differently shaped data

How will this work, since each chunk will require a different padding (presumably)? Are you going to add a new codec?

For example consider three arrays with shapes (1,1), (1,2), (1,3), stacked along axis=0.

TomNicholas · 2024-03-19T16:41:12Z

This use case was @ayushnag's who used kerchunk somehow to add some NaNs to the data (he had a poster at Ocean Sciences but I forgot the details). You raise a good question @dcherian, so I'm curious what exactly the final zarr arrays ended up looking like in the SWOT case.

I guess that at the very least padding the array with new empty chunks of the same size should be possible, though that might not cover that many use cases. One we have variable-sized chunks the padding would become more general.

ayushnag · 2024-03-19T16:46:35Z

I did use kerchunk for SWOT but I wasn't able to pad them so I only used the granules with the most common swath length. This was for a demo though and ideally there is some way to pad with nan's.

TomNicholas · 2024-03-19T16:53:18Z

For example consider three arrays with shapes (1,1), (1,2), (1,3), stacked along axis=0.

To be concrete, if the chunksize for all these three arrays is (1,1), then empty(/nonexistent) chunks can be padded to each array in turn no problem such that the shapes match and the stacking becomes possible. I agree that if the chunksizes are actually (1,1), (1,2), (1,3) then it's a lot harder.

dcherian · 2024-03-19T16:57:14Z

I agree that if the chunksizes are actually (1,1), (1,2), (1,3) then it's a lot harder.

IIRC this is a good model for the swath problem. I think the (only?) solution is a new codec since we need to transform the chunk itself. Something like:

chunk: zarr.Array
out = np.full_like(new_shape, fill_value=chunk.fillvalue)
slicer = tuple(slice(size) for size in chunks.shape)
out[slicer] = np.array(chunk)

TomNicholas · 2024-03-19T17:11:41Z

think the (only?) solution

I guess in this case even variable-length chunks wouldn't save us because this would actually require "ragged" chunks (i.e. no longer a rectilinear array of chunks)...

IIRC this is a good model for the swath problem. I think the (only?) solution is a new codec since we need to transform the chunk itself. Something like:

I don't really understand what that code snippet is meant to do but my understanding of your suggestion is to add a codec which records the new padded shape of the chunk, then so when the chunk is read if you're indexing into the unpadded part it indexes into the normal zarr array, but if you're indexing into the padded part it just returns the fillvalue. But isn't the problem with that idea that you would need only certain chunks to have that codec, but codecs are set on a per-array basis, not a per-chunk basis?

dcherian · 2024-03-19T17:50:36Z

I guess in this case even variable-length chunks wouldn't save us because this would actually require "ragged" chunks

Yes exactly.

my understanding of your suggestion is to add a codec which records the new padded shape of the chunk

The "new padded shape" is just the regularized chunk shape, it should be inferred from the array metadata.

when the chunk is read if you're indexing into the unpadded part it indexes into the normal zarr array, but if you're indexing into the padded part it just returns the fillvalue.

At read time, all chunks are processed to a regular chunk shape so the rest of Zarr does not care about what it is actually indexing.

This was referenced Mar 14, 2024

Approach to Null value handling #32

Open

In-memory representation of chunks: array instead of a dict? #33

Closed

[WIP] Structured array for manifest #39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Padding arrays with NaNs #22

Padding arrays with NaNs #22

TomNicholas commented Mar 11, 2024 •

edited

Loading

TomNicholas commented Mar 19, 2024

dcherian commented Mar 19, 2024

TomNicholas commented Mar 19, 2024 •

edited

Loading

ayushnag commented Mar 19, 2024

TomNicholas commented Mar 19, 2024

dcherian commented Mar 19, 2024

TomNicholas commented Mar 19, 2024

dcherian commented Mar 19, 2024

Padding arrays with NaNs #22

Padding arrays with NaNs #22

Comments

TomNicholas commented Mar 11, 2024 • edited Loading

TomNicholas commented Mar 19, 2024

dcherian commented Mar 19, 2024

TomNicholas commented Mar 19, 2024 • edited Loading

ayushnag commented Mar 19, 2024

TomNicholas commented Mar 19, 2024

dcherian commented Mar 19, 2024

TomNicholas commented Mar 19, 2024

dcherian commented Mar 19, 2024

TomNicholas commented Mar 11, 2024 •

edited

Loading

TomNicholas commented Mar 19, 2024 •

edited

Loading