add package schema and documentation #89

duncandewhurst · 2019-02-25T16:45:44Z

Adds a project package schema for metadata, based on the OCDS release package schema.

jpmckinney · 2019-02-25T16:53:32Z

I think this will require more discussion. I'm not sure packages were even a good idea in OCDS (see CRM-4282). They add complexity without a clear and significant benefit.

jpmckinney · 2019-02-25T16:54:12Z

I think the only metadata that is critical to data workflows is the publisher, which can be added to the project schema (see latest discussion in open-contracting/standard#325).

duncandewhurst · 2019-02-25T17:03:45Z

My thinking was that, for the time being, using packages is consistent with OCDS and since many implementations are likely to be in tandem with OCDS this means publishers and users only have to understand and deal with one approach to providing metadata.

jpmckinney · 2019-02-25T17:47:03Z

The only metadata fields here are publisher, license, and publication policy. (uri and published date are metadata about the metadata, so I'm excluding them from consideration, as we first need to determine whether the metadata is useful before determining what meta-metadata to include.)

The lesson from OCDS seems to be that the publisher is "data" and belongs on the release schema (project schema here).

License and publication policy metadata are important, but it isn't critical that they be distributed as data. Most open data (CSVs, etc.) have no means of declaring their license or publication policy, but this poses no significant problem to reuse – these are instead declared on the HTML pages that serve the data. Users generally only need to refer to these once, so it's not a challenge to data workflows.

I don't think we should add packaging at this time simply to enable the declaration in data of licenses and publication policies.

duncandewhurst · 2019-02-27T11:35:05Z

Asides from metadata, the other reason to have packaging is so there is a defined approach for how data on multiple projects should be published. Without a defined approach publishers might choose their own JSON package format, use a ZIP file of individual project JSON files, or use another approach. This makes it hard for users (and tools such as CoVE) to handle published data, as they don't know what format it will be in.

I discussed pro's and con's of different types of package with @kindly but I think for the purposes of OC4IDS, consistency with the approach used in OCDS is important, hence proposing using the same package format.

Noting also that although there will only be one version of OC4IDS at launch, we should include version in the metadata so this can be explicitly declared to avoid future problems of tools needing to make assumptions about the version of data where it isn't declared.

jpmckinney · 2019-02-27T16:44:26Z

We should have a defined approach, but I'm not confident that the approach should be packages. A non-exhaustive list of alternatives is: ZIP file (OCDS anticipates zipping packages), JSON array (we've seen OCDS publishers publish JSON arrays of packages), JSON stream.

I think it might take some time to agree on an approach. If we can't reach an approach in time, I think it's better to postpone the approach, than to share an approach and then change it later.

The complexity of packaging has been raised by helpdesk analysts in past OCDS retreats. We've also witnessed a "double-packaging" of JSON arrays of packages (and we even recommend a double-packaging of packages in ZIP files). There's room for improvement, and this improvement might be made as part of OCDS 1.2.

If we deprecate packages in OCDS 1.2, then having packages in OC4IDS will be baggage that we then need to handle. If we sort out a better approach in the short term, then we might put that approach in OC4IDS, and then later put that approach in OCDS 1.2 (pending governance process).

Regarding versioning, this might be better handled by using the $schema property, which is part of JSON Schema, as discussed in open-contracting/standard#426. That property is standardized, and thus has a lot of existing tooling that understands it, and can use it to perform JSON Schema validation. As OC4IDS doesn't (yet) have an extensions mechanism, using $schema has fewer edge cases to considder than in OCDS.

If we absolutely need packaging in the OC4IDS beta (I wasn't aware it was a requirement – this is the first issue to deal with it), then I propose a JSON array. If we later decide OCDS-style packages are the right way to go, then for implementers, it's a question of wrapping the JSON array. A JSON array has fewer gotchas than a JSON stream (e.g. Windows vs. Unix line endings, accidental inclusion/omission of a line ending, etc.). A JSON array is also immediately usable compared to a ZIP file.

kindly · 2019-02-28T09:23:28Z

I feel that we need some kind of approach before people start publishing against it.
I think having something (even if not ideal) is better than having nothing; we do not want the publishers to have more uncertainty then they need to.

As long as representations have an easy way to map to each other without loss then I think starting on any approach is fine i.e it is easy to map a package to a JSON array also.

This issue is now coming up with BODS (whose default is a JSON array) and there is reluctance to repeat the metadata in every statement due to repetition and size.

Some additional thoughts:

I agree that publisher is better to have embedded.
$schema property for version only is useful if the top level is an object but I am not convinced about its use anyway.
I think license in the data is better. Mainly as in the long run it makes it easier for data aggregators to process data across publishers.

I personally think we should keep with packages like they are in OCDS and then migrate to a different approach if we find it better across the board. I am not sure this project should be the testbed for different ideas here.
Changing it will make any tooling (i.e validator) or concept (extensions) more work.

I just do not understand the pain or complication of packages in OCDS apart from dealing with big files where a JSON Stream would be better.

Personally I think multiple allowed representations is fine as long as they are defined and losslessy mapable (with a tool) on to each other (we already have spreadsheet representation) but I think an OCDS style package should be one of them.

jpmckinney · 2019-02-28T16:16:57Z

I feel like there is unnecessary urgency to this discussion. If we knew packaging was in scope, it should have been brought up before the alpha. The fact that it was forgotten isn't reason enough to quickly push something into the beta. That doesn't respect our standard development principles. If there is insufficient time or agreement, we push the decision to the next release. Most data standards have no packaging specification, and yet they succeed; this isn't a live-or-die issue.

I would be in favor of closing the pull request and opening an issue, so that we can frame the discussion as, "what is the right packaging specification" or going further back to "what do we need to support bulk data access" instead of "should we push this in or not". Right now, the first questions are being influenced by the last.

Regarding OCDS-style packages: We have issues open to improve packaging documentation like open-contracting/standard#605, because we've seen through helpdesk support that implementers are confused about how to implement it. I already noted a couple examples of mis-implementation above. At the August 2018 retreat, helpdesk analysts also reported that record packages often contain a single record – which isn't an improvement over simply publishing individual records; deprecating the record package was noted as an option to consider.

Regarding this PR: Given that the number of projects by any publisher is not large, and given that having a URL for a project's data is desirable, we can have the initial publication pattern be to publish individual project files, and to offer some form of index to those files. That index can be standardized at a later date: whether it's a package, a list of URLs, or something else would be determined following our usual process without artificial urgency.

duncandewhurst · 2019-03-06T11:06:34Z

I feel like there is unnecessary urgency to this discussion. If we knew packaging was in scope, it should have been brought up before the alpha. The fact that it was forgotten isn't reason enough to quickly push something into the beta. That doesn't respect our standard development principles.

Packaging wasn't explicitly in or out of scope for the project and I agree it would have been better to resolve this earlier, however I do feel that not including an approach to packaging in the beta would be an omission which we have an opportunity to address.

Given the existing alignment between OC4IDS and OCDS I do not feel that simply replicating the packaging mechanism from OCDS at this stage breaks our principles, since it is something the community are familiar with and it doesn't require any extra data collection from the publishers.

Given that the number of projects by any publisher is not large, and given that having a URL for a project's data is desirable, we can have the initial publication pattern be to publish individual project files, and to offer some form of index to those files.

I disagree that the number of projects by publisher is not large, CoST Ukraine's portal includes more than 3,500 projects for a single procuring entity. CoST's granular definition of a project (effectively a single construction contract) means that we can expect most implementations to have large numbers of projects.

Within the next couple of months we will be working on a data review tool for OC4IDS and without a defined approach to packaging, which is unlikely to be agreed in time due to the timescales for OCDS 1.2 development, this will result more work being required to develop that tooling.

Not having a defined approach in the documentation will lead to helpdesk analysts having to spend more time explaining possible approaches to implementers and in the absence of any other definitive guidance they are likely to recommend the same approach as used in OCDS.

I prefer to have a documented approach now, which is consistent with OCDS, and which we can change later once a broader discussion has taken place about the preferred approach(es), over creating more work and uncertainty for publishers, tool developers and helpdesk analysts.

jpmckinney

The operational considerations are convincing. I've suggested changes.

docs/reference/index.md

docs/reference/project_package.md

schema/project-level/project-package-schema.json

docs/reference/index.md

docs/reference/project_package.md

schema/project-level/project-package-schema.json

jpmckinney

A couple more changes, sorry :)

After the version description is updated, feel free to merge.

docs/reference/index.md

docs/reference/project_package.md

…lect version number

add package schema and documentation@

947eef8

duncandewhurst added schema This issue relates to the schema documentation This issue relates to the documentation labels Feb 25, 2019

duncandewhurst added this to the Beta milestone Feb 25, 2019

duncandewhurst requested a review from jpmckinney February 25, 2019 16:45

ScatteredInk mentioned this pull request Feb 28, 2019

Metadata openownership/data-standard#135

Closed

jpmckinney reviewed Mar 7, 2019

View reviewed changes

duncandewhurst added 2 commits March 7, 2019 15:44

update packaging documentation

381b67b

package schema fixes, add version key

3e4ef5d

duncandewhurst commented Mar 7, 2019

View reviewed changes

schema/project-level/project-package-schema.json Outdated Show resolved Hide resolved

duncandewhurst mentioned this pull request Mar 8, 2019

Update worked example #47

Closed

jpmckinney reviewed Mar 9, 2019

View reviewed changes

refine language in project packaging documentation

9c694db

jpmckinney reviewed Mar 11, 2019

View reviewed changes

docs/reference/index.md Outdated Show resolved Hide resolved

jpmckinney reviewed Mar 11, 2019

View reviewed changes

docs/reference/project_package.md Outdated Show resolved Hide resolved

duncandewhurst added 3 commits March 11, 2019 14:46

update version key to use major.minor format, update schema id to ref…

85a38bb

…lect version number

rename project_package.md, update language in reference.md

c7df5e8

merge master into branch

f02b112

duncandewhurst merged commit f175f7c into master Mar 11, 2019

duncandewhurst deleted the package-metadata branch March 11, 2019 15:15

duncandewhurst mentioned this pull request Mar 12, 2019

Update blank JSON file to include project package and object identifiers #114

Closed

This was referenced Oct 6, 2020

Add describedby field for the extended release schema open-contracting/standard#426

Open

Deprecate remaining package metadata and add bulk data format open-contracting/standard#1084

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add package schema and documentation #89

add package schema and documentation #89

duncandewhurst commented Feb 25, 2019

jpmckinney commented Feb 25, 2019 •

edited

Loading

jpmckinney commented Feb 25, 2019

duncandewhurst commented Feb 25, 2019

jpmckinney commented Feb 25, 2019

duncandewhurst commented Feb 27, 2019

jpmckinney commented Feb 27, 2019 •

edited

Loading

kindly commented Feb 28, 2019

jpmckinney commented Feb 28, 2019

duncandewhurst commented Mar 6, 2019

jpmckinney left a comment

jpmckinney left a comment •

edited

Loading

add package schema and documentation #89

add package schema and documentation #89

Conversation

duncandewhurst commented Feb 25, 2019

jpmckinney commented Feb 25, 2019 • edited Loading

jpmckinney commented Feb 25, 2019

duncandewhurst commented Feb 25, 2019

jpmckinney commented Feb 25, 2019

duncandewhurst commented Feb 27, 2019

jpmckinney commented Feb 27, 2019 • edited Loading

kindly commented Feb 28, 2019

jpmckinney commented Feb 28, 2019

duncandewhurst commented Mar 6, 2019

jpmckinney left a comment

Choose a reason for hiding this comment

jpmckinney left a comment • edited Loading

Choose a reason for hiding this comment

jpmckinney commented Feb 25, 2019 •

edited

Loading

jpmckinney commented Feb 27, 2019 •

edited

Loading

jpmckinney left a comment •

edited

Loading