Add `validate` command #46

jqnatividad · 2021-09-23T19:19:49Z

Checks a CSV against a jsonschema file.

The jsonschema file can be located on the filesystem or a URL.

mhuang74 · 2021-11-29T12:28:35Z

Some good ideas for reference:

theodi/shared#142

mhuang74 · 2022-01-10T13:00:23Z

@jqnatividad Please let me know if you have some thoughts on the primary usecases.

identify all schema/value errors in CSV and fix them
quickly determine if CSV data matches expected schema
generate schema from CSV

Here's 1st pass at docopt

Validate CSV data with JSON Schema.

Usage:
    qsv validate [options] [<input>] <json-schema>

fetch options:
    -c, --new-column <name>    Put error(s) in a new column instead.
    --fail-fast                Stops on first error.
    --generate-schema          Generate JSON Schema from data.


Common options:
    -h, --help                 Display this message
    -o, --output <file>        Write output to <file> instead of stdout.
    -n, --no-headers           When set, the first row will not be interpreted
                               as headers. Namely, it will be sorted with the rest
                               of the rows. Otherwise, the first row will always
                               appear as the header row in the output.
    -d, --delimiter <arg>      The field delimiter for reading CSV data.
                               Must be a single character. (default: ,)
    -q, --quiet                Don't show progress bars.

jqnatividad · 2022-01-10T13:45:00Z

Hi @mhuang74,

Awesome! I'm keen to see validate implemented.

The primary usecases are the ones you listed, but I think we should break it up into three commands.

validate - 1) identify all schema/value errors and 2) determine if CSV data matches expected schema, with two modes:

given a jsonschema file - validate using the schema. The schema can be a file or a url where the jsonschema can be found.
It produces three outputs:

A valid-data CSV file. All records passing validation (add a -valid.csv suffix automatically if --valid option is not specified)
An invalid CSV file. All records failing validation (add an -invalid.csv suffix automatically if --invalid option is not specified)
A validation report. A human-readable, but also machine-readable file that can be used by another command to fix invalid rows. See fix-data below)

csvlint.io is a good reference for this (and the other ODI ideas you pointed out - Support validating a CSV file against a simple JSON schema theodi/shared#142).

It will also have a --fail-fast mode that doesn't produce these three outputs for this potentially expensive operation, but just gives a non-zero exit code (perhaps, the row of the first invalid record?)

without a jsonschema - it just validates that a CSV is a valid per RFC4180 standard

for both modes, zero exit code if CSV file is valid

jsonschema or schema - generate schema from CSV. I was thinking we should leverage the output of stats and optionally frequency to determine the data types and the range/domain of each field.

fix-data (to be consistent with fix-lengths) - for the rows failing validation, an -invalid.csv with some error metadata columns is produced by validate. This command allows the corrections to be be processed at scale (e.g. invalid values set to NULL/None/empty or a given value for a field; invalid values set to Default value or Default error value (per schema?)), with safe default "corrections" if corrections are not explicitly specified.

Thoughts?

jqnatividad · 2022-01-10T13:52:14Z

As for the docopt format, it looks good, but we can tackle that once we agree on the general approach.

However, with that said, I think this is where multi-threading can really make a big difference, as we can break up the validation into chunks as each row can be validated independently, so we should include the --jobs option.

mhuang74 · 2022-01-19T14:14:18Z

@jqnatividad here's my POC for validate command using jasonschema-rs crate.

#145

Example run documented in Validate.md.

One problems that I need help with. Progress bar would panic if csv file has unequal number of columns (utils.rs:117). I am not sure how to refactor this. Any thoughts?

$ cat invalid.csv
header1,header2
1,2,3
foo,bar
quux,baz,foobar
hello, world

$ RUST_BACKTRACE=1 ./target/debug/qsv validate invalid.csv
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error(UnequalLengths { pos: Some(Position { byte: 16, line: 2, record: 1 }), expected_len: 2, len: 3 })', src/util.rs:117:53
stack backtrace:
   0: rust_begin_unwind
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/std/src/panicking.rs:498:5
   1: core::panicking::panic_fmt
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/panicking.rs:107:14
   2: core::result::unwrap_failed
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/result.rs:1613:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/result.rs:1295:23
   4: qsv::util::count_rows
             at ./src/util.rs:117:19
   5: qsv::cmd::validate::run
             at ./src/cmd/validate.rs:88:28
   6: qsv::Command::run
             at ./src/main.rs:327:34
   7: qsv::main
             at ./src/main.rs:177:28
   8: core::ops::function::FnOnce::call_once
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/ops/function.rs:227:5

works fine when progress bar is quieted

$ RUST_BACKTRACE=1 ./target/debug/qsv validate invalid.csv --quiet
CSV error: record 1 (line: 2, byte: 16): found record with 3 fields, but the previous record has 2 fields

jqnatividad · 2022-01-19T19:24:43Z

@mhuang74 thanks for this!

The problem is the progress bar does a count ala qsv count. And since the file is invalid, it panics since the header doesn't match.

The fix is relatively simple, just set flexible to true setting up the Config ReaderBuilder.

And while we're on the progress bar, you may want to use the new update_cache_info! macro.

I'll add comments to the PR for both changes...

mhuang74 · 2022-01-20T08:26:49Z

thanks @jqnatividad. Above problem resolved. Simple csv check integration tests passing now.

I need to add another test to properly check for output files.

Any suggestions on how to check for output files?

jqnatividad · 2022-01-20T12:46:26Z

Great!

As for testing the output files, having a sample csv and corresponding, relatively complex jsonschema (the PublicToilets schema on csvlint seems interesting and relatively complex), with expected valid and invalid rows should be good enough.

mhuang74 · 2022-01-21T13:25:00Z

@jqnatividad Need some help understanding standards around json schema, and whether choice of jsonschema crate is the right approach. Looks like it is working for dependent project cjval validating CityJSON files. But it doesn't work for the Public Toilet schema.

Looks like the Public Toilets json schema conforms to JSON Table Schema rather than the JSONSchema which jsonschema-rs crate assumes.

I attempted to run the public toilet schema through jsonschema-rs for validation, and it did not pick up and apply the constraints.

Specifically, the "keywordLocation" and "instanceLocation" are all blank.

what jsonschema-rs outputs with example People schema from JSON Schema site

[src/cmd/validate.rs:143] &validation_result = Object({
    "valid": Bool(
        false,
    ),
    "errors": Array([
        Object({
            "keywordLocation": String(
                "/properties/age/type",
            ),
            "instanceLocation": String(
                "/age",
            ),
            "absoluteKeywordLocation": String(
                "https://example.com/properties/age/type",
            ),
            "error": String(
                "\"21\" is not of type \"integer\"",
            ),
        }),
    ]),
})

snippet what jsonschema-rs outputs with Public Toilet schema

[src/cmd/validate.rs:143] &validation_result = Object({
    "valid": Bool(
        true,
    ),
    "annotations": Array([
        Object({
            "keywordLocation": String(
                "",
            ),
            "instanceLocation": String(
                "",
            ),
            "annotations": Object({
                "fields": Array([
                    Object({
                        "name": String(
                            "ExtractDate",
                        ),
                        "title": String(
                            "Extract Date",
                        ),
                        "description": String(
                            "The date that the data was last extracted from its source database or manually updated",
                        ),
                        "constraints": Object({
                            "required": String(
                                "true",
                            ),
                            "type": String(
                                "http://www.w3.org/2001/XMLSchema#string",
                            ),
                            "pattern": String(
                                "(^([0-2]\\d|[3][0-1])\\/([0]\\d|[1][0-2])\\/[1-2]\\d{3}((\\s([0-1]\\d|[2][0-3]):[0-5]\\d(:[0-5]\\d)?)?)|)$",
                            ),
                        }),
                    }),

jqnatividad · 2022-01-21T15:02:31Z

@mhuang74 let's just create a test schema & a corresponding CSV we can use.

jsonschema seems to be the most mature library we can leverage so let's stick with it (not to mention the same author wrote the infers-jsonschema crate that we can leverage for the schema command).

Although JSON Table Schema is a specialized standard for tabular data (read CSVs), there seems to be no other crate we can leverage for the non-trivial validation task. But since JSON Table Schema is a subset of the IETF JSON Schema standard, it should be good enough.

What about creating a sample schema using the sample NYC 311 data that's used in qsv's benchmark script?

It has different data types and opportunities for exercising validation tasks that jsonschema can do; it's a good showcase example; and we can baseline validate's performance with the benchmark.

mhuang74 · 2022-01-22T08:29:36Z

Thanks @jqnatividad. They all require hand crafting schema file, so I stuck with public toilet dataset.

Integration test uses real data and illustrates invalid row showing up in output.invalid file.

jqnatividad · 2022-01-22T13:47:04Z

Public toilet it is! :)

Lemme know when you want me to merge the PR. Convert it to draft in the meantime?

mhuang74 · 2022-01-22T14:20:08Z

Thanks @jqnatividad. I think it's ready for beta release. What do you think?

jqnatividad · 2022-01-22T23:29:46Z

Yep! Prepping for 0.3.0 release... :)

Did some minor tweaks though - #148

mhuang74 · 2022-01-23T01:38:56Z

Thanks for the fixups @jqnatividad . Good catch on progress bar. And just realized that named parameters are supported.

jqnatividad · 2022-01-23T01:52:56Z

@mhuang74, our MSRV is 1.58.1, so might as well use the latest features.

And I tweaked validate further just now to create the error-report as a jsonl file - easier to digest for both humans and machines 😏 with the ability to convert it to csv too with the jsonl command - #149

jqnatividad · 2022-01-23T13:29:09Z

@mhuang74 and 0.3.0 featuring validate is released and we can now close this issue!

However, I had to limit cross-compilation for now, so binaries for certain platforms are not published (c54a2f2).

Good news is that I managed to trace the problem to jsonschema, and the fix has been merged (Stranger6667/jsonschema#336), and we'll be able to publish binaries for those platforms again in the next release.

Thanks heaps again for your contributions!!!

Are you up to taking on schema (#60)? 😉 It need not use stats and frequency for the first release, it can just use infers-jsonschema...

mhuang74 · 2022-01-24T14:40:41Z

Thanks for all your help with tidying up the code @jqnatividad.

One more bug fix PR for validate. PR #151

Sure, this has been great for my Rust learning. Happy to pick up schema (#60) next.

jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Sep 23, 2021

jqnatividad self-assigned this Sep 23, 2021

jqnatividad mentioned this issue Oct 2, 2021

Create schema command #60

Closed

jqnatividad added this to the 0.17.0 milestone Oct 3, 2021

jqnatividad modified the milestones: 0.17.0, 0.18.0 Oct 13, 2021

jqnatividad changed the title ~~Add "validate" command~~ Add validate command Oct 13, 2021

jqnatividad mentioned this issue Nov 19, 2021

fake command #106

Closed

jqnatividad mentioned this issue Dec 17, 2021

fetch command #77

Closed

jqnatividad pinned this issue Jan 10, 2022

jqnatividad assigned mhuang74 and unassigned jqnatividad Jan 15, 2022

mhuang74 mentioned this issue Jan 19, 2022

Validate command #145

Merged

jqnatividad closed this as completed Jan 23, 2022

jqnatividad unpinned this issue Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `validate` command #46

Add `validate` command #46

jqnatividad commented Sep 23, 2021

mhuang74 commented Nov 29, 2021

mhuang74 commented Jan 10, 2022

jqnatividad commented Jan 10, 2022 •

edited

Loading

jqnatividad commented Jan 10, 2022 •

edited

Loading

mhuang74 commented Jan 19, 2022 •

edited

Loading

jqnatividad commented Jan 19, 2022 •

edited

Loading

mhuang74 commented Jan 20, 2022

jqnatividad commented Jan 20, 2022

mhuang74 commented Jan 21, 2022 •

edited

Loading

jqnatividad commented Jan 21, 2022 •

edited

Loading

mhuang74 commented Jan 22, 2022 •

edited

Loading

jqnatividad commented Jan 22, 2022

mhuang74 commented Jan 22, 2022

jqnatividad commented Jan 22, 2022

mhuang74 commented Jan 23, 2022

jqnatividad commented Jan 23, 2022

jqnatividad commented Jan 23, 2022 •

edited

Loading

mhuang74 commented Jan 24, 2022 •

edited

Loading

Add validate command #46

Add validate command #46

Comments

jqnatividad commented Sep 23, 2021

mhuang74 commented Nov 29, 2021

mhuang74 commented Jan 10, 2022

jqnatividad commented Jan 10, 2022 • edited Loading

jqnatividad commented Jan 10, 2022 • edited Loading

mhuang74 commented Jan 19, 2022 • edited Loading

jqnatividad commented Jan 19, 2022 • edited Loading

mhuang74 commented Jan 20, 2022

jqnatividad commented Jan 20, 2022

mhuang74 commented Jan 21, 2022 • edited Loading

jqnatividad commented Jan 21, 2022 • edited Loading

mhuang74 commented Jan 22, 2022 • edited Loading

jqnatividad commented Jan 22, 2022

mhuang74 commented Jan 22, 2022

jqnatividad commented Jan 22, 2022

mhuang74 commented Jan 23, 2022

jqnatividad commented Jan 23, 2022

jqnatividad commented Jan 23, 2022 • edited Loading

mhuang74 commented Jan 24, 2022 • edited Loading

Add `validate` command #46

Add `validate` command #46

jqnatividad commented Jan 10, 2022 •

edited

Loading

jqnatividad commented Jan 10, 2022 •

edited

Loading

mhuang74 commented Jan 19, 2022 •

edited

Loading

jqnatividad commented Jan 19, 2022 •

edited

Loading

mhuang74 commented Jan 21, 2022 •

edited

Loading

jqnatividad commented Jan 21, 2022 •

edited

Loading

mhuang74 commented Jan 22, 2022 •

edited

Loading

jqnatividad commented Jan 23, 2022 •

edited

Loading

mhuang74 commented Jan 24, 2022 •

edited

Loading