Templating and deterministic maps #7

rabernat · 2020-11-19T16:21:41Z

In yesterday's meeting, we discussed the widespread desire for this spec to handle deterministic, evenly spaced-out chunks within a file, rather than explicitly enumerating each chunk and its offsets. We would need some sort of support for templates and symbolic expressions. For a 1D file, it could be something like

"{i}", ["s3://bucket/path/file.nc", "3200 * {i}", 3200]

or

"{i}", ["s3://bucket/path/file.nc", "{itemsize} * {i}", 3200]

where {itemsize} would get filled with 3200

For ND it would be harder because you need to know the array shape:

"{j}.{i}", ["s3://bucket/path/file.nc", "{itemsize} * ({j} * Nx + {i})", 3200]

I can't think of a clever way around that.

Ideas? @manzt? @joshmoore?

The text was updated successfully, but these errors were encountered:

martindurant · 2020-11-19T16:39:00Z

In sketching this out, we may also want to consider:

key names that are regex, which is a well-defined thing
expressions are jinja-like, including support for functions
expressions are simple, like "itemsize * i" (actually, if we know itemsize, I would replace it before saving)

I wonder, is there an overlap with the discussion about coordinates specified programatically in xarray?

To what extent does this the putative option of binary storage for offsets (i.e., not json, perhaps zarr itself) mitigate the problem, due to efficient compression of integers?

Sorry if that's too many thoughts for the limits of this issue.

rabernat · 2020-11-19T16:42:54Z

Given the complexity, we might also consider punting on this and releasing an initial spec that requires explicit keys.

martindurant · 2020-11-19T16:45:13Z

A spec can be added to easier than removed from ...

joshmoore · 2021-02-03T12:06:30Z

Martin suggested my adding feedback I received from the @bioformats team here. I asked in a general way if there were anything needed for the spec to maximize the number of formats that we could support. There's hope that some files for some of the proprietary formats will be readable, but just taking the most prevalent (TIFF), there are a large number of edge cases that would require a callback function of some form. So a fourth argument:

(name, templated-size, templated-offset, processing-function)

since data that is interleaved, reversed, sample packed, etc. might need reversing, striding, etc. Details can be found in TiffParser.getSamples. Obviously, this is in no way a MUST (all for KISS) but it would be interesting to hear if a minimal library of such functions would help in other domains as well.

cc: @melissalinkert @dgault @sbesson @cgohlke @manzt

martindurant mentioned this issue Mar 2, 2021

Spec version 1 #17

Merged

martindurant closed this as completed Apr 13, 2021

lmmx mentioned this issue Aug 17, 2021

Contributing a CSV module [RE: dask dataframe read_csv] #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Templating and deterministic maps #7

Templating and deterministic maps #7

rabernat commented Nov 19, 2020

martindurant commented Nov 19, 2020

rabernat commented Nov 19, 2020

martindurant commented Nov 19, 2020

joshmoore commented Feb 3, 2021

Templating and deterministic maps #7

Templating and deterministic maps #7

Comments

rabernat commented Nov 19, 2020

martindurant commented Nov 19, 2020

rabernat commented Nov 19, 2020

martindurant commented Nov 19, 2020

joshmoore commented Feb 3, 2021