Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create schema command #60

Closed
jqnatividad opened this issue Oct 2, 2021 · 38 comments
Closed

Create schema command #60

jqnatividad opened this issue Oct 2, 2021 · 38 comments
Assignees
Labels
enhancement New feature or request. Once marked with this label, its in the backlog.
Milestone

Comments

@jqnatividad
Copy link
Collaborator

stats does a great job of not only getting descriptive stats about a CSV, it also infers the data type.
frequency compiles a frequency table.

The schema command will use the output of the stats, and optionally frequency (to specify the valid range of a field), to create a json schema file that can be used with the validate command (#46) to validate a CSV against the generated schema.

With the combo addition of schema and validate, qsv can be used in a more bullet-proof automated data pipeline that can fail gracefully when there are data quality issues:

  • use schema to create a json schema from a representative CSV file for a feed
  • adjust the schema to fine-tune the validation rules
  • use validate at the beginning of a data pipeline and fail gracefully when validate fails
  • for extra large files, use sample to validate against a sample
  • or alternatively, partition the CSV to break down the pipeline into smaller jobs
@jqnatividad jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Oct 2, 2021
@jqnatividad jqnatividad added this to the 0.17.0 milestone Oct 2, 2021
@jqnatividad jqnatividad modified the milestones: 0.17.0, 0.18.0 Oct 13, 2021
@github-actions
Copy link

Stale issue message

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Jan 10, 2022

schema is part of the validate suite of commands - #46

@jqnatividad
Copy link
Collaborator Author

Hi @mhuang74, I'm assigning this to you as you're on a roll! 😉

For the first cut, we don't need to use stats and frequency to add the range/domain/data type to the inferred schema. Have you decided which library to use? https://github.com/Stranger6667/infers-jsonschema seems to be the best one as its written by the same author as jsonschema-rs.

@mhuang74
Copy link
Contributor

Thanks @jqnatividad. Actually, stats and frequency is still the right approach. I will try to leverage existing code by making them public.

infers-jsonschema requires a properly typed JSON in the first place, the construction of which still requires type inference based on parsing and counting.

@mhuang74 mhuang74 mentioned this issue Feb 2, 2022
@jqnatividad jqnatividad added the WIP work in progress label Feb 2, 2022
@mhuang74
Copy link
Contributor

mhuang74 commented Feb 2, 2022

Hi @jqnatividad, thanks for merging the POC. The first cut only supports one out of the many validation vocabularies that JSON Schema supports.

To support more in a manner that would be useful and practical, perhaps each could be turned into an option that requires specific column selection, and schema command can fill in the validation constraints based on data values.

qsv schema mydata.csv --enum col3 --maxLength col5

Thoughts?

6.1. Validation Keywords for Any Instance Type

6.1.1. type
6.1.2. enum
6.1.3. const

6.2. Validation Keywords for Numeric Instances (number and integer)

6.2.1. multipleOf
6.2.2. maximum
6.2.3. exclusiveMaximum
6.2.4. minimum
6.2.5. exclusiveMinimum

6.3. Validation Keywords for Strings

6.3.1. maxLength
6.3.2. minLength
6.3.3. pattern

@mhuang74
Copy link
Contributor

mhuang74 commented Feb 2, 2022

@jqnatividad for the pattern validation keyword, perhaps it would be interesting to use grex to generate regex based on csv values?

@jqnatividad
Copy link
Collaborator Author

stats and frequency creates a lot of descriptive statistics we can readily leverage for these validation keywords. We can directly map:

  • stats type to type
  • jschema const if the stats cardinality of a field is 1
  • jschema enum if the cardinality of a field is below a certain threshold (and then we can get the list of enums from frequency)
  • stats min/max to minimum/maximum
  • stats min/max_length to maxLength/minLength

and grex would be perfect for pattern!

We can even have pre-baked, optimized regexes for certain date formats (ISO 8601/RFC 3339) and support the date defined formats.

This really make qsv's schema/validate combo very compelling!

And yes, within reason, having options to fine-tune these settings would be great, but at the end of the day, the user can always manually tweak the resulting jsonschema, so having sensible defaults would be OK.

Then again, docopt makes it easy expose these defaults to the user, so I'll leave it to you.

Awesome work as usual @mhuang74 ! Really excited by the POC even in its present state.

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Feb 3, 2022

Even multipleOf can be inferred using stats.

For an integer data type, if min * cardinality = max and rowcount = cardinality, then you can set multipleOf to min.

@mhuang74
Copy link
Contributor

mhuang74 commented Feb 6, 2022

@jqnatividad Here's another iteration of schema which leverages cmd::stats. Please let me know if the integration approach looks okay.

#158

Also, here's the current docopt. Please note:

  • I kept the original flow and made calling stats optional. If there is no need to add value constraints, then it should run much quicker without invoking cmd::stats.
  • generating pattern would be expensive, so add option to limit to select columns
  • took out multipleOf because seems like usually people ought to know if data should be in multiples of 10, 100, 1000, etc. The value of calculating Greatest Common Denominator for data validation seems like an edge usecase ?
Generate JSON Schema from CSV data.

This command generates reference JSON Schema (Draft 7) from CSV data, 
including simple validation rules based on data statistics.
Please verify and adjust the Type, Value Constraints, and Required Fields as appropriate.

Example output file from `mydata.csv`. If piped from stdin, then filename is `stdin.csv`.

* mydata.csv.schema.json

Usage:
    qsv schema [options] [<input>]

Schema options:
    --value-constraints        Add value constraints based on CSV data (enum, min, max, minLength, MaxLength)
    --enum-threshold NUM       Cardinality threshold for adding enum constraints [default: 30]
    --pattern-columns <args>   Select columns to add pattern constraints [default: none]

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Feb 6, 2022

Excellent work as usual @mhuang74 !

I like how you do full round-tripping test with schema and validate and how you organized the integration tests to use a shared CSV.

One thing that just occurred to me, given how large and verbose the JSON schema and CSV files are for testing schema and validate, we may want to actually save them as discrete files and load them during the integration tests. The good thing about this approach is that the same tests files can be used for demos. WDYT?

As for docopt, I agree with your choices, though you may want to also set const when cardinality = 1.

Regardless, I'm merging #158 just in time for the weekly release. 😄

jqnatividad added a commit that referenced this issue Feb 6, 2022
`schema`: add value constraints via stats. Continue iterating on #60 ...
@mhuang74
Copy link
Contributor

mhuang74 commented Feb 6, 2022

Thanks @jqnatividad. One little PR before heading to bed. Wanted to make command description clearer.

#159

@jqnatividad
Copy link
Collaborator Author

Hi @mhuang74 ,
I was testing schema and validate against https://raw.githubusercontent.com/wiki/jqnatividad/qsv/files/NYC_311_SR_2010-2020-sample-1M.7z and roundtripping validation was not working (i.e. create the schema from the CSV and validate the same CSV using the generated schema).

I managed to get it to validate by removing enum property when it only contains null. Shouldn't that be the norm?

Also, all the fields are added as required. Can we also infer required as a value constraint only when a field's nullcount = 0?

@mhuang74
Copy link
Contributor

mhuang74 commented Feb 10, 2022

Thanks @jqnatividad for catching this bug.

#163

All fields are required because all columns are expected in CSV. But required columns may still have no value (denoted by "null" in Type list).

I am running on an old machine, but it's pretty clear that validate could use some performance boost.

$ head -50000 NYC_311_SR_2010-2020-sample-1M.csv > NYC-short.csv
$ qsvlite index NYC-short.csv
$ time qsvlite schema NYC-short.csv --value-constraints --enum-threshold=25
Schema written to NYC-short.csv.schema.json

real	1m10.350s
user	2m59.430s
sys	0m3.241s
$ time qsvlite validate NYC-short.csv NYC-short.csv.schema.json 
[00:03:45] [==================== 100% validated 49,999 records.] (222/sec)
0 out of 49,999 records invalid.

real	3m45.273s
user	3m44.656s
sys	0m0.360s

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Feb 10, 2022

Thanks for the quick fix @mhuang74 ... just merged #163.

And thanks for clarifying required as I was still thinking in the tabular data context.

Speaking of which, I haven't found a good, performant implementation of JSON Table Schema - hopefully, your work here can be the foundation of a de facto jsonschema.rs based implementation, which we can later adapt as per Stranger6667/jsonschema#339.

As for performance, on my fairly recent Ryzen 7 4800H 2.9 GHz laptop with 32gb, its much faster even if I'm running it on an Oracle VirtualBox VM (Ubuntu 20.04 LTS allocated with 12gb):

$ time qsvlite schema NYC-short.csv --value-constraints --enum-threshold=25
Schema written to NYC-short.csv.schema.json

real	0m2.275s
user	0m4.166s
sys	0m1.043s
$ time qsvlite validate NYC-short.csv NYC-short.csv.schema.json 
[00:00:05] [==================== 100% validated 49,999 records.] (9,954/sec)
0 out of 49,999 records invalid.

real	0m5.029s
user	0m4.987s
sys	0m0.040s

IMHO, schema's performance is good enough as it wouldn't be run that often, typically, only when setting up a data pipeline, given that its comprehensively scanning a CSV compiling descriptive statistics.

But validate will definitely benefit from some multi-threading, as it will be used much more often in the beginning of a data pipeline.

@mhuang74
Copy link
Contributor

mhuang74 commented Feb 10, 2022

Not sure what is going on, but performance can be significantly different from my laptop (Intel Core i3 M 370 @ 2.40GHz w/ 2 cores but Hyper-threads to 4; 8 GB RAM). Now I am getting descent performance. Still using debug build.

Opened #164 to improve validate performance.

I will take this chance to play with the performance test suite and add validate to it.

$ time qsvlite schema NYC-short.csv --value-constraints --enum-threshold=25
Schema written to NYC-short.csv.schema.json

real	0m7.837s
user	0m13.619s
sys	0m3.647s
$ time qsvlite validate NYC-short.csv NYC-short.csv.schema.json 
[00:00:08] [==================== 100% validated 49,999 records.] (5,984/sec)
0 out of 49,999 records invalid.

real	0m8.465s
user	0m8.308s
sys	0m0.064s

@jqnatividad
Copy link
Collaborator Author

Even though validate's baseline performance is already good, this is where Rust should really shine once you start optimizing.

And yes, my numbers are using the release build.

@mhuang74
Copy link
Contributor

mhuang74 commented Feb 10, 2022

@jqnatividad I ran schema without --value-constraints, and the resultant schema caused validate to abort on row 4017. Running with --value-constraints took twice as long, but validate did not flag any invalid records. Should we optimize for schema that accepts all values (including missing values), or highlight missing values ?

  • since performance of running with --value-constraints seems acceptable, and schema won't be re-generated often, I could make this the default and drop the simple frequency counting approach ? But it also hides missing values problem.
  • in this case, Incident Zip column had mostly integers, but sometimes blank. So frequency counting ended up with integer type without option to be null. This limitation probably makes resultant schema require manual editing and could be annoying. On the other hand, it does highlight missing values right off the bat.

Release build was used for below.

without --value-constraints.

$ time qsvlite schema NYC_311_SR_2010-2020-sample-1M.csv
Schema written to NYC_311_SR_2010-2020-sample-1M.csv.schema.json

real	1m4.286s
user	0m55.389s
sys	0m8.879s
$ time qsvlite validate NYC_311_SR_2010-2020-sample-1M.csv NYC_311_SR_2010-2020-sample-1M.csv.schema.json 
Unable to convert CSV to json. row: 4017, error: Can't cast into Integer. header: Incident Zip, value: N/A, json type: integer

real	0m0.437s
user	0m0.433s
sys	0m0.004s

with --value-constraints

mhuang@twisted-linen:~/Downloads/tmp $ time qsvlite schema NYC_311_SR_2010-2020-sample-1M.csv --value-constraints --enum-threshold=100
Schema written to NYC_311_SR_2010-2020-sample-1M.csv.schema.json

real	2m30.489s
user	2m9.383s
sys	0m20.876s
mhuang@twisted-linen:~/Downloads/tmp $ less NYC_311_SR_2010-2020-sample-1M.csv.schema.json
mhuang@twisted-linen:~/Downloads/tmp $ time qsvlite validate NYC_311_SR_2010-2020-sample-1M.csv NYC_311_SR_2010-2020-sample-1M.csv.schema.json 
[00:02:18] [==================== 100% validated 1,000,000 records.] (7,235/sec)
0 out of 1,000,000 records invalid.

real	2m18.356s
user	2m17.043s
sys	0m1.052s

@jqnatividad
Copy link
Collaborator Author

I suggest to make --value-constraints the default and drop the simple_frequency approach.

As for missing values, can't we leverage stats nullcount for that?

Running qsvlite stats --everything NYC_311_SR_2010-2020-sample-1M.csv > nyc1mstats.csv, we get:

nyc1mstats.csv

For Incident Zip, we get the following stats:

field type sum min max min_length max_length mean stddev variance lower_fence q1 q2_median q3 iqr upper_fence skew mode cardinality nullcount
Incident Zip String   * XXXXX 0 10       8934 10314 11205 11234 920 12614 -1.98298   535 54978

@mhuang74
Copy link
Contributor

Thanks @jqnatividad. nullcount is actually already being used to determine whether to inject "null" type and null enum for columns that have missing values.

btw, I came across interesting documentation for doing the describe-extract-validate-transform steps with the tool from frictionless.

@jqnatividad
Copy link
Collaborator Author

OK @mhuang74 , so missing values is handled with --value-constraints and the point is moot with simple_frequency going away.

And yes, big fan of the frictionless data project - as I'm heavily involved in open data and the CKAN project in particular :).

However, most of their tooling is in python and not as performant... but if we can learn anything from the project, we should shamelessly borrow :)

@mhuang74
Copy link
Contributor

@jqnatividad Integrated grex with PR #168. Probably helpful in some scenarios though lack the smarts to recognize common formats. Let me know what you think.

@jqnatividad
Copy link
Collaborator Author

Great @mhuang74! As for lack of smarts, its certainly not for lack of great code :)

I'm sure we can iterate and integrate some light-weight heuristics to discern common formats - perhaps a combination of column name rules (columns ending with date, dt, mail, etc.), precompiled regexes, etc.

Will do some testing on my end and let you know...

jqnatividad added a commit that referenced this issue Feb 15, 2022
`schema` (#60): pattern constraint for string types.

BTW, good call on the `enum` heuristic,
@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Feb 22, 2022

@mhuang74 we helped with a Smart City pilot last year, where there were a lot of data quality issues coming from the IoT sensors.

I ended up using jsonschema there as well to validate the input files, but it was on python, so it was nowhere near as performant as qsv validate.

Interestingly, a lot of the validation errors were on the timestamp field, as different sensors were sending them in varying formats, sometimes, the format even changing from the same vendor/sensor as they tweak the configuration.

As you know, I used dateparser in stats to infer the Date type, and it can infer dates in various formats.

I'm thinking of extending stats so it can infer two additional date data types - date-time and time, so we can implement Date and Time support per the jsonschema spec (RFC 3339 and all date/time attributes except duration which we can skip for now as I don't really see it being widely used).

In this way, schema (which uses stats - for the benefit of folks reading this thread) can just directly add the corresponding date/time regex pattern without having to invoke grex. We also end up with a more standards-compliant jsonschema.

WDYT?

@mhuang74
Copy link
Contributor

mhuang74 commented Feb 23, 2022

@jqnatividad I think this is definitely worth looking into. Maybe the adur public toilets dataset is not representative, but it has date stamps that fail the "date" constraint of JSON Schema. So in this case, dateparser may infer ExtractDate column is date, add date constraint in schema, but entire file would then fail validation ?

adding format="date" to validate adur public toilets results in error

2       ExtractDate     "07/07/2014" is not a "date"
4       ExtractDate     "07/07/2014 00:00" is not a "date"

expected date stamp format

2014-07-07

schema used

    "ExtractDate": {
      "description": "ExtractDate column from adur-public-toilets.csv",
      "type": [
        "string"
      ],
      "format": "date"
    },

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Feb 23, 2022

@mhuang74 Yes. It SHOULD fail per jsonschema spec as the date is not in RFC3339 format.

But in the real-world, folks often use their locale date format. Perhaps we should add a schema --laxdate option so dates that are not in RFC3339 format are just interpreted as strings with the grex-inferred pattern.

    "ExtractDate": {
      "description": "ExtractDate column from adur-public-toilets.csv",
      "type": [
        "string"
      ],
      "pattern": "^(?:\d{2}/){2}\d{4}(?: \d{2}:\d{2})?$"
    },

so the adur public toilets dataset (which is representative of the real world) will not fail validation.

One reason why I added the datefmt subcommand to apply is to allow folks to quickly convert any recognized date to RFC-3339 so dates can be quickly normalized in a data pipeline.

Still, some users may prefer to not transform dates to RFC3339 and this is where --laxdate will be useful. WDYT?

@mhuang74
Copy link
Contributor

mhuang74 commented Feb 24, 2022

@jqnatividad Yes, I agree a switch to specify strictness of date format would be quite useful. My preference is to flip it around and have a --strict-dates flag.

I prefer to have schema command to do what's hard for me to do manually: scan all the values and derive something helpful (eg min, max, enum list, etc).

But if I want to enforce some rules that may or may not fit existing values, I would probably just hand-edit the schema file. But dates are so common that it may be nice to have a flag to simplify the process.

If the --strict-dates flag is set, then always add format="date" constraint to fields of type "date".

If not, then only add format="date" when all date values are RFC3339 compliant.

And similarly for "datetime" and "time" types.

But that's just my preference. I think either way would work well.

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Feb 24, 2022

@mhuang74 I agree. A better sensible default is to allow "lax dates" and have --strict-dates as an option.

I'll extend stats this weekend to infer the additional date data types (#174), so we can also add format="date-time" and format="time" when --strict-dates is set.

@jqnatividad
Copy link
Collaborator Author

Hi @mhuang74 , I added support for inferring DateTime (RFC3339 with time component), differentiating it from the Date data type (RFC3339 without time component).

I decided to skip inferring time as I don't really see it being used IRL, and I wanted to minimize the performance impact on stats.

Do you want to take on implementing --strict-dates? 😉

@jqnatividad jqnatividad removed the WIP work in progress label Feb 28, 2022
@mhuang74
Copy link
Contributor

mhuang74 commented Mar 3, 2022

Thanks @jqnatividad for merging PR #177. I noticed funkiness with WorkDir:

  • temp test directories don't get deleted
  • directories are not truly unique between runs, hence may see output from previous runs. especially if only run a small number of tests (e.g. cargo test schema --tests)
  • I added method to remove contents of WorkDir. So far only use it in test_schema. Without clearing, test_schema test not expecting any error output files would sometimes fail
  • sometimes the outdated index test fails for me. suspect it may be due to same root cause

notice two sets of timestamps in this test dir

failures:
    test_schema::generate_schema_with_defaults_and_validate_with_no_errors

test result: FAILED. 3 passed; 1 failed; 0 ignored; 0 measured; 649 filtered out; finished in 0.23s

error: test failed, to rerun pass '--test tests'
(base) ➜  qsv git:(schema_strict_dates) ✗ ll /home/mhuang/Projects/rust/qsv/target/debug/xit/schema/test-0/
total 44K
-rw-rw-r-- 1 mhuang mhuang 9.1K Mar  3 20:14 adur-public-toilets.csv
-rw-rw-r-- 1 mhuang mhuang 8.5K Mar  3 16:24 adur-public-toilets.csv.invalid
-rw-rw-r-- 1 mhuang mhuang  11K Mar  3 20:14 adur-public-toilets.csv.schema.json
-rw-rw-r-- 1 mhuang mhuang 1012 Mar  3 16:24 adur-public-toilets.csv.valid
-rw-rw-r-- 1 mhuang mhuang  785 Mar  3 16:24 adur-public-toilets.csv.validation-errors.tsv

@jqnatividad
Copy link
Collaborator Author

Thanks @mhuang74 for quickly implementing --strict-dates and fixing WorkDir as well!

Just in time for the NYC Open Data Week presentation this Saturday 😄

Will be sure to send you a draft of the preso before then... https://nycsodata22.sched.com/

BTW, I noticed a big performance regression with stats, perhaps some of my optimization tweaks did the opposite.

Will fix it next week...

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Mar 3, 2022

and with --strict-dates in place. I think we can now close this issue.

Thanks heaps @mhuang74!

@mhuang74
Copy link
Contributor

mhuang74 commented Mar 3, 2022

Conference talk on qsv sounds exciting. Looking forward to your presentation @jqnatividad. Thanks for closing issue.

@jqnatividad
Copy link
Collaborator Author

Hi @mhuang74 , didn't get a chance to send it to you before the talk (was working on it till the last minute), but I linked it on the README... 😄

@mhuang74
Copy link
Contributor

mhuang74 commented Mar 9, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. Once marked with this label, its in the backlog.
Projects
None yet
Development

No branches or pull requests

2 participants