Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specs compliance #45

Merged
merged 7 commits into from
Jul 13, 2017
Merged

Specs compliance #45

merged 7 commits into from
Jul 13, 2017

Conversation

simleo
Copy link
Member

@simleo simleo commented Jul 7, 2017

Addresses #3.

SUMMARY:

  • Adds a validation module for CMSO-specific tabular data packages. The validation process first checks that the data package is a valid tabular data package via the Frictionless API, then performs several additional checks to verify that it's also a valid CMSO data package (and implicitly defines our specs, since they have not been formalized yet).
  • Checks that the package name provided via the top-level configuration section respects the Frictionless constraints (lowercase alphanumeric characters plus ._-). Since 1) the top-level section is the only "external" source of metadata and 2) the "name" property is the only one with special requirements, this should ensure that data packages we write are always specs-compliant (barring mistakes in our code, ofc).
  • Adds a validation script and a command that validates all data packages in our examples to the Docker build (and thus to the Travis build).

TESTING:

Check that the build is green and that the newly added tests and checks have been executed in the Travis log.

NOTES:

  • The proper way to implement CMSO datapackage validation would be to define our own JSON schema for it (for instance, see the fiscal data package schema) and then simply use the standard Frictionless validation engine. However: 1) this is out of the scope of the current milestone and 2) Frictionless specs are currently undergoing a major change, so it's probably best to wait until they stabilize.

  • We are currently validating using the stable Python data package API (0.x), which uses pre-1.0 specs, but we should keep an eye on the upcoming 1.0 specs. In particular, naming the JSON file datapackage.json is going to be a requirement (see http://specs.frictionlessdata.io/data-package/).

@simleo simleo force-pushed the specs_compliance branch from 7dcb221 to 0d178b6 Compare July 7, 2017 17:53
@simleo simleo requested review from pcmasuzzo and gsergeant July 10, 2017 08:44
Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good and very nice to have a validation API that we can start consuming in https://github.com/CellMigStandOrg/CMSO-datasets.

A few questions about references and tests. Otherwise, this looks generally in agreement with the current upstream specification. Leaving @pcmasuzzo and @gsergeant to look at this in the context of the Java implementation.


import datapackage as dp
from jsontableschema import infer
from .names import OBJECTS_TABLE_NAME, LINKS_TABLE_NAME


NAME_PATTERN = re.compile(r"^[a-z0-9_.-]+$")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a reference/link to explain the rationale for this pattern?

Copy link
Member Author

@simleo simleo Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in f6f5f9b. The relevant bit is

name ... MUST be lower-case and contain only alphanumeric characters along with ".", "_" or "-" characters

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

OBJECTS_PATH = "objects.csv"
LINKS_PATH = "links.csv"
TRACKS_PATH = "tracks.csv"
JSON = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One test design question about the strategy of generating a JSON consuming biotracks.names. If these constants were to be modified in a future breaking change, these tests would keep passing although we would probably like to know that the implementation/specification change would affect the validation of existing JSON. Or would that be a different specification upgrade test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have a separate test. This one tests the functioning of the validation engine itself, not that of individual packages. If we make breaking changes in the future, we should probably add spec version awareness to the validator and a spec upgrade test.

Copy link
Member

@gsergeant gsergeant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems straightforward, very extensive checks! 💯
Will look deeper into this for Java implementation.

Copy link
Member

@pcmasuzzo pcmasuzzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is all good!


import datapackage as dp
from jsontableschema import infer
from .names import OBJECTS_TABLE_NAME, LINKS_TABLE_NAME


NAME_PATTERN = re.compile(r"^[a-z0-9_.-]+$")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@simleo simleo merged commit 8bafe8f into CellMigStandOrg:master Jul 13, 2017
@simleo simleo deleted the specs_compliance branch July 13, 2017 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants