Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation feature #17

Open
pont-us opened this issue Mar 9, 2021 · 7 comments
Open

Add validation feature #17

pont-us opened this issue Mar 9, 2021 · 7 comments
Assignees
Labels
enhancement New feature or request in progress Started working on this

Comments

@pont-us
Copy link
Member

pont-us commented Mar 9, 2021

It would be useful to have an output validation option for nc2zarr, along the lines of

nc2zarr --config myconfig.yaml --validate

which would go through the input files specified in the configuration and make sure, for each data value corresponding to a variable specified in the configuration, that a matching data value exists in the configured output Zarr.

This would of course be potentially very time-consuming for large datasets; --structure-only (just check that the structure's as expected) and --sparse (validate some small but representative subset of the data) may also be useful.

@pont-us pont-us added the enhancement New feature or request label Mar 9, 2021
@forman
Copy link
Member

forman commented Mar 10, 2021

I agree, this is very usefull, but I disagree with another handfull of CLI options.

I suggest verifing that the output is readble through xr.open_dataset(path, **open_kwargs). We can configure this in as a new top-level entry in the config. As a second step we later support assertions on the dataset content:

verify:
  open_kwargs:
     engine: "zarr"
     decode_cf: true
  assertions:
     attrs:
        ...
     data_vars:
        ...
     coord_vars:
        ...        

@forman
Copy link
Member

forman commented Mar 10, 2021

Note, I'd call the feature "verify" rather than "validate", because validate would also analyse the dataset's values which I feel is out-of-scope for nc2zarr.

@forman
Copy link
Member

forman commented Mar 10, 2021

Another note:

  • a --verify CLI flag could be used to perform a verifcation with xr.open_dataset(path) with defaults, even without verify entry in config;
  • a --verify-off flag could be used to not perform verification, even with verify entry in config.

@pont-us
Copy link
Member Author

pont-us commented Mar 10, 2021

a --verify CLI flag could be used to perform a verifcation with xr.open_dataset(path) with defaults, even without verify entry in config;

This is what I had in mind, at least initially: verification just using the existing configuration for input, by checking that every specified input value (or some representative sample) is also present in the output. Of course that doesn't rule out adding a more elaborate, configurable verification facility later.

On reflection, I agree about calling it "verify". "Validate" probably implies something a bit deeper than just checking input.var[lat, lon, t] == output.var[lat, lon, t] for all variables and co-ordinates.

@forman forman self-assigned this Mar 10, 2021
@forman
Copy link
Member

forman commented Mar 10, 2021

Hi @pont-us, I started an implementation you may have a look already.

@forman forman added the in progress Started working on this label Mar 11, 2021
@pont-us
Copy link
Member Author

pont-us commented Jun 21, 2021

Branch containing implementation: https://github.com/bcdev/nc2zarr/tree/forman-17-verify_dataset

@pont-us
Copy link
Member Author

pont-us commented Jun 21, 2021

Probably not relevant to the main verification / validation implementation, but just for reference: I've committed a small standalone validation script, which I'm using to validate the converted Zarrs against selected source NetCDFs for the next deliverable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request in progress Started working on this
Projects
None yet
Development

No branches or pull requests

2 participants