Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validate command #46

Closed
jqnatividad opened this issue Sep 23, 2021 · 18 comments
Closed

Add validate command #46

jqnatividad opened this issue Sep 23, 2021 · 18 comments
Assignees
Labels
enhancement New feature or request. Once marked with this label, its in the backlog.
Milestone

Comments

@jqnatividad
Copy link
Collaborator

Checks a CSV against a jsonschema file.

The jsonschema file can be located on the filesystem or a URL.

@jqnatividad jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Sep 23, 2021
@jqnatividad jqnatividad self-assigned this Sep 23, 2021
@jqnatividad jqnatividad added this to the 0.17.0 milestone Oct 3, 2021
@jqnatividad jqnatividad modified the milestones: 0.17.0, 0.18.0 Oct 13, 2021
@jqnatividad jqnatividad changed the title Add "validate" command Add validate command Oct 13, 2021
@mhuang74
Copy link
Contributor

Some good ideas for reference:

theodi/shared#142

@mhuang74
Copy link
Contributor

@jqnatividad Please let me know if you have some thoughts on the primary usecases.

  • identify all schema/value errors in CSV and fix them
  • quickly determine if CSV data matches expected schema
  • generate schema from CSV

Here's 1st pass at docopt

Validate CSV data with JSON Schema.

Usage:
    qsv validate [options] [<input>] <json-schema>

fetch options:
    -c, --new-column <name>    Put error(s) in a new column instead.
    --fail-fast                Stops on first error.
    --generate-schema          Generate JSON Schema from data.


Common options:
    -h, --help                 Display this message
    -o, --output <file>        Write output to <file> instead of stdout.
    -n, --no-headers           When set, the first row will not be interpreted
                               as headers. Namely, it will be sorted with the rest
                               of the rows. Otherwise, the first row will always
                               appear as the header row in the output.
    -d, --delimiter <arg>      The field delimiter for reading CSV data.
                               Must be a single character. (default: ,)
    -q, --quiet                Don't show progress bars.

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Jan 10, 2022

Hi @mhuang74,

Awesome! I'm keen to see validate implemented.

The primary usecases are the ones you listed, but I think we should break it up into three commands.

validate - 1) identify all schema/value errors and 2) determine if CSV data matches expected schema, with two modes:

  1. given a jsonschema file - validate using the schema. The schema can be a file or a url where the jsonschema can be found.
    It produces three outputs:
  • A valid-data CSV file. All records passing validation (add a -valid.csv suffix automatically if --valid option is not specified)
  • An invalid CSV file. All records failing validation (add an -invalid.csv suffix automatically if --invalid option is not specified)
  • A validation report. A human-readable, but also machine-readable file that can be used by another command to fix invalid rows. See fix-data below)

    csvlint.io is a good reference for this (and the other ODI ideas you pointed out - Support validating a CSV file against a simple JSON schema theodi/shared#142).

    It will also have a --fail-fast mode that doesn't produce these three outputs for this potentially expensive operation, but just gives a non-zero exit code (perhaps, the row of the first invalid record?)
  1. without a jsonschema - it just validates that a CSV is a valid per RFC4180 standard
  • for both modes, zero exit code if CSV file is valid

jsonschema or schema - generate schema from CSV. I was thinking we should leverage the output of stats and optionally frequency to determine the data types and the range/domain of each field.

fix-data (to be consistent with fix-lengths) - for the rows failing validation, an -invalid.csv with some error metadata columns is produced by validate. This command allows the corrections to be be processed at scale (e.g. invalid values set to NULL/None/empty or a given value for a field; invalid values set to Default value or Default error value (per schema?)), with safe default "corrections" if corrections are not explicitly specified.

Thoughts?

@jqnatividad jqnatividad pinned this issue Jan 10, 2022
@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Jan 10, 2022

As for the docopt format, it looks good, but we can tackle that once we agree on the general approach.

However, with that said, I think this is where multi-threading can really make a big difference, as we can break up the validation into chunks as each row can be validated independently, so we should include the --jobs option.

@mhuang74
Copy link
Contributor

mhuang74 commented Jan 19, 2022

@jqnatividad here's my POC for validate command using jasonschema-rs crate.

#145

Example run documented in Validate.md.

One problems that I need help with. Progress bar would panic if csv file has unequal number of columns (utils.rs:117). I am not sure how to refactor this. Any thoughts?

$ cat invalid.csv
header1,header2
1,2,3
foo,bar
quux,baz,foobar
hello, world

$ RUST_BACKTRACE=1 ./target/debug/qsv validate invalid.csv
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error(UnequalLengths { pos: Some(Position { byte: 16, line: 2, record: 1 }), expected_len: 2, len: 3 })', src/util.rs:117:53
stack backtrace:
   0: rust_begin_unwind
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/std/src/panicking.rs:498:5
   1: core::panicking::panic_fmt
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/panicking.rs:107:14
   2: core::result::unwrap_failed
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/result.rs:1613:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/result.rs:1295:23
   4: qsv::util::count_rows
             at ./src/util.rs:117:19
   5: qsv::cmd::validate::run
             at ./src/cmd/validate.rs:88:28
   6: qsv::Command::run
             at ./src/main.rs:327:34
   7: qsv::main
             at ./src/main.rs:177:28
   8: core::ops::function::FnOnce::call_once
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/ops/function.rs:227:5

works fine when progress bar is quieted

$ RUST_BACKTRACE=1 ./target/debug/qsv validate invalid.csv --quiet
CSV error: record 1 (line: 2, byte: 16): found record with 3 fields, but the previous record has 2 fields

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Jan 19, 2022

@mhuang74 thanks for this!

The problem is the progress bar does a count ala qsv count. And since the file is invalid, it panics since the header doesn't match.

The fix is relatively simple, just set flexible to true setting up the Config ReaderBuilder.

And while we're on the progress bar, you may want to use the new update_cache_info! macro.

I'll add comments to the PR for both changes...

@mhuang74
Copy link
Contributor

thanks @jqnatividad. Above problem resolved. Simple csv check integration tests passing now.

I need to add another test to properly check for output files.

Any suggestions on how to check for output files?

@jqnatividad
Copy link
Collaborator Author

Great!

As for testing the output files, having a sample csv and corresponding, relatively complex jsonschema (the PublicToilets schema on csvlint seems interesting and relatively complex), with expected valid and invalid rows should be good enough.

@mhuang74
Copy link
Contributor

mhuang74 commented Jan 21, 2022

@jqnatividad Need some help understanding standards around json schema, and whether choice of jsonschema crate is the right approach. Looks like it is working for dependent project cjval validating CityJSON files. But it doesn't work for the Public Toilet schema.

Looks like the Public Toilets json schema conforms to JSON Table Schema rather than the JSONSchema which jsonschema-rs crate assumes.

I attempted to run the public toilet schema through jsonschema-rs for validation, and it did not pick up and apply the constraints.

Specifically, the "keywordLocation" and "instanceLocation" are all blank.

what jsonschema-rs outputs with example People schema from JSON Schema site

[src/cmd/validate.rs:143] &validation_result = Object({
    "valid": Bool(
        false,
    ),
    "errors": Array([
        Object({
            "keywordLocation": String(
                "/properties/age/type",
            ),
            "instanceLocation": String(
                "/age",
            ),
            "absoluteKeywordLocation": String(
                "https://example.com/properties/age/type",
            ),
            "error": String(
                "\"21\" is not of type \"integer\"",
            ),
        }),
    ]),
})

snippet what jsonschema-rs outputs with Public Toilet schema

[src/cmd/validate.rs:143] &validation_result = Object({
    "valid": Bool(
        true,
    ),
    "annotations": Array([
        Object({
            "keywordLocation": String(
                "",
            ),
            "instanceLocation": String(
                "",
            ),
            "annotations": Object({
                "fields": Array([
                    Object({
                        "name": String(
                            "ExtractDate",
                        ),
                        "title": String(
                            "Extract Date",
                        ),
                        "description": String(
                            "The date that the data was last extracted from its source database or manually updated",
                        ),
                        "constraints": Object({
                            "required": String(
                                "true",
                            ),
                            "type": String(
                                "http://www.w3.org/2001/XMLSchema#string",
                            ),
                            "pattern": String(
                                "(^([0-2]\\d|[3][0-1])\\/([0]\\d|[1][0-2])\\/[1-2]\\d{3}((\\s([0-1]\\d|[2][0-3]):[0-5]\\d(:[0-5]\\d)?)?)|)$",
                            ),
                        }),
                    }),


@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Jan 21, 2022

@mhuang74 let's just create a test schema & a corresponding CSV we can use.

jsonschema seems to be the most mature library we can leverage so let's stick with it (not to mention the same author wrote the infers-jsonschema crate that we can leverage for the schema command).

Although JSON Table Schema is a specialized standard for tabular data (read CSVs), there seems to be no other crate we can leverage for the non-trivial validation task. But since JSON Table Schema is a subset of the IETF JSON Schema standard, it should be good enough.

What about creating a sample schema using the sample NYC 311 data that's used in qsv's benchmark script?

It has different data types and opportunities for exercising validation tasks that jsonschema can do; it's a good showcase example; and we can baseline validate's performance with the benchmark.

@mhuang74
Copy link
Contributor

mhuang74 commented Jan 22, 2022

Thanks @jqnatividad. They all require hand crafting schema file, so I stuck with public toilet dataset.

Integration test uses real data and illustrates invalid row showing up in output.invalid file.

@jqnatividad
Copy link
Collaborator Author

Public toilet it is! :)

Lemme know when you want me to merge the PR. Convert it to draft in the meantime?

@mhuang74
Copy link
Contributor

Thanks @jqnatividad. I think it's ready for beta release. What do you think?

@jqnatividad
Copy link
Collaborator Author

Yep! Prepping for 0.3.0 release... :)

Did some minor tweaks though - #148

@mhuang74
Copy link
Contributor

Thanks for the fixups @jqnatividad . Good catch on progress bar. And just realized that named parameters are supported.

@jqnatividad
Copy link
Collaborator Author

@mhuang74, our MSRV is 1.58.1, so might as well use the latest features.

And I tweaked validate further just now to create the error-report as a jsonl file - easier to digest for both humans and machines 😏 with the ability to convert it to csv too with the jsonl command - #149

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Jan 23, 2022

@mhuang74 and 0.3.0 featuring validate is released and we can now close this issue!

However, I had to limit cross-compilation for now, so binaries for certain platforms are not published (c54a2f2).

Good news is that I managed to trace the problem to jsonschema, and the fix has been merged (Stranger6667/jsonschema#336), and we'll be able to publish binaries for those platforms again in the next release.

Thanks heaps again for your contributions!!!

Are you up to taking on schema (#60)? 😉 It need not use stats and frequency for the first release, it can just use infers-jsonschema...

@mhuang74
Copy link
Contributor

mhuang74 commented Jan 24, 2022

Thanks for all your help with tidying up the code @jqnatividad.

One more bug fix PR for validate. PR #151

Sure, this has been great for my Rust learning. Happy to pick up schema (#60) next.

@jqnatividad jqnatividad unpinned this issue Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. Once marked with this label, its in the backlog.
Projects
None yet
Development

No branches or pull requests

2 participants