Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add array or regex to data #23

Open
mmaelicke opened this issue Aug 20, 2024 · 2 comments
Open

Add array or regex to data #23

mmaelicke opened this issue Aug 20, 2024 · 2 comments

Comments

@mmaelicke
Copy link
Member

In the data section, there might be a case, in which not only a single file is associated to an input dataset, but a list of files.
In these cases we can either:

  1. allow an array similar to the parameters
  2. allow regular expressions as an attribute inside the data section

In both cases, the specs cannot handle cases in which the number of datasets is arbitrary. In these cases, the developer has to fall back to specify a directory in the parameters instead of data section.

An example for the multi-files case: A tool takes a netCDF, which is chunked into many files

An example for the multi-dataset (multi-files): An aggregator or viewer tool takes a folder as input, that contains data folders. Similar to what the data loader creates.
I would argue, that this is an edge case and usually tools can specify the data they need.

I am in preference of setting ie. a multi=True flag on a data spec, which effectively allows wildcards in the path

@Ash-Manoj @AlexDo1 do you have any comments on this? I am not entirely sure how to do that and comments are welcome

@AlexDo1
Copy link
Contributor

AlexDo1 commented Aug 20, 2024

Hm, good question.

I like the multi flag, as this also quite clearly states that there can be more than one data file. Just always allowing wildcards could be confusing, as it would not be clear via the specification if multiple data files are allowed.

At the moment I'm also in favor of allowing wildcards then, as this allows to be stricter in defining the file names (e.g. in/precipitation/preciptitation_*.nc for in/precipitation/preciptitation_2011.nc, in/precipitation/preciptitation_2012.nc, in/precipitation/preciptitation_2013.nc.
But the wildcard also would allow to just take everything inside a folder as input data, even when the file names are not that structured, e.g. in/data/* for in/data/air_temperature.nc, in/data/discharge.csv, in/data/catchment.geojson (would probably be bad implementation to have that as input data, but I think it demonstrates what I mean).

So I like the flexibility of the wildcard together with the clarity of the multi flag.

@Ash-Manoj
Copy link
Contributor

I also like the flag idea. We could test this on the catflow generator tool where I think multiple tiff files have to be read in as input for the tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants