-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add package schema and documentation #89
Conversation
I think this will require more discussion. I'm not sure packages were even a good idea in OCDS (see CRM-4282). They add complexity without a clear and significant benefit. |
I think the only metadata that is critical to data workflows is the publisher, which can be added to the project schema (see latest discussion in open-contracting/standard#325). |
My thinking was that, for the time being, using packages is consistent with OCDS and since many implementations are likely to be in tandem with OCDS this means publishers and users only have to understand and deal with one approach to providing metadata. |
The only metadata fields here are publisher, license, and publication policy. (uri and published date are metadata about the metadata, so I'm excluding them from consideration, as we first need to determine whether the metadata is useful before determining what meta-metadata to include.) The lesson from OCDS seems to be that the publisher is "data" and belongs on the release schema (project schema here). License and publication policy metadata are important, but it isn't critical that they be distributed as data. Most open data (CSVs, etc.) have no means of declaring their license or publication policy, but this poses no significant problem to reuse – these are instead declared on the HTML pages that serve the data. Users generally only need to refer to these once, so it's not a challenge to data workflows. I don't think we should add packaging at this time simply to enable the declaration in data of licenses and publication policies. |
Asides from metadata, the other reason to have packaging is so there is a defined approach for how data on multiple projects should be published. Without a defined approach publishers might choose their own JSON package format, use a ZIP file of individual project JSON files, or use another approach. This makes it hard for users (and tools such as CoVE) to handle published data, as they don't know what format it will be in. I discussed pro's and con's of different types of package with @kindly but I think for the purposes of OC4IDS, consistency with the approach used in OCDS is important, hence proposing using the same package format. Noting also that although there will only be one version of OC4IDS at launch, we should include version in the metadata so this can be explicitly declared to avoid future problems of tools needing to make assumptions about the version of data where it isn't declared. |
We should have a defined approach, but I'm not confident that the approach should be packages. A non-exhaustive list of alternatives is: ZIP file (OCDS anticipates zipping packages), JSON array (we've seen OCDS publishers publish JSON arrays of packages), JSON stream. I think it might take some time to agree on an approach. If we can't reach an approach in time, I think it's better to postpone the approach, than to share an approach and then change it later. The complexity of packaging has been raised by helpdesk analysts in past OCDS retreats. We've also witnessed a "double-packaging" of JSON arrays of packages (and we even recommend a double-packaging of packages in ZIP files). There's room for improvement, and this improvement might be made as part of OCDS 1.2. If we deprecate packages in OCDS 1.2, then having packages in OC4IDS will be baggage that we then need to handle. If we sort out a better approach in the short term, then we might put that approach in OC4IDS, and then later put that approach in OCDS 1.2 (pending governance process). Regarding versioning, this might be better handled by using the If we absolutely need packaging in the OC4IDS beta (I wasn't aware it was a requirement – this is the first issue to deal with it), then I propose a JSON array. If we later decide OCDS-style packages are the right way to go, then for implementers, it's a question of wrapping the JSON array. A JSON array has fewer gotchas than a JSON stream (e.g. Windows vs. Unix line endings, accidental inclusion/omission of a line ending, etc.). A JSON array is also immediately usable compared to a ZIP file. |
I feel that we need some kind of approach before people start publishing against it. As long as representations have an easy way to map to each other without loss then I think starting on any approach is fine i.e it is easy to map a package to a JSON array also. This issue is now coming up with BODS (whose default is a JSON array) and there is reluctance to repeat the metadata in every statement due to repetition and size. Some additional thoughts:
I personally think we should keep with packages like they are in OCDS and then migrate to a different approach if we find it better across the board. I am not sure this project should be the testbed for different ideas here. I just do not understand the pain or complication of packages in OCDS apart from dealing with big files where a JSON Stream would be better. Personally I think multiple allowed representations is fine as long as they are defined and losslessy mapable (with a tool) on to each other (we already have spreadsheet representation) but I think an OCDS style package should be one of them. |
I feel like there is unnecessary urgency to this discussion. If we knew packaging was in scope, it should have been brought up before the alpha. The fact that it was forgotten isn't reason enough to quickly push something into the beta. That doesn't respect our standard development principles. If there is insufficient time or agreement, we push the decision to the next release. Most data standards have no packaging specification, and yet they succeed; this isn't a live-or-die issue. I would be in favor of closing the pull request and opening an issue, so that we can frame the discussion as, "what is the right packaging specification" or going further back to "what do we need to support bulk data access" instead of "should we push this in or not". Right now, the first questions are being influenced by the last. Regarding OCDS-style packages: We have issues open to improve packaging documentation like open-contracting/standard#605, because we've seen through helpdesk support that implementers are confused about how to implement it. I already noted a couple examples of mis-implementation above. At the August 2018 retreat, helpdesk analysts also reported that record packages often contain a single record – which isn't an improvement over simply publishing individual records; deprecating the record package was noted as an option to consider. Regarding this PR: Given that the number of projects by any publisher is not large, and given that having a URL for a project's data is desirable, we can have the initial publication pattern be to publish individual project files, and to offer some form of index to those files. That index can be standardized at a later date: whether it's a package, a list of URLs, or something else would be determined following our usual process without artificial urgency. |
Packaging wasn't explicitly in or out of scope for the project and I agree it would have been better to resolve this earlier, however I do feel that not including an approach to packaging in the beta would be an omission which we have an opportunity to address. Given the existing alignment between OC4IDS and OCDS I do not feel that simply replicating the packaging mechanism from OCDS at this stage breaks our principles, since it is something the community are familiar with and it doesn't require any extra data collection from the publishers.
I disagree that the number of projects by publisher is not large, CoST Ukraine's portal includes more than 3,500 projects for a single procuring entity. CoST's granular definition of a project (effectively a single construction contract) means that we can expect most implementations to have large numbers of projects. Within the next couple of months we will be working on a data review tool for OC4IDS and without a defined approach to packaging, which is unlikely to be agreed in time due to the timescales for OCDS 1.2 development, this will result more work being required to develop that tooling. Not having a defined approach in the documentation will lead to helpdesk analysts having to spend more time explaining possible approaches to implementers and in the absence of any other definitive guidance they are likely to recommend the same approach as used in OCDS. I prefer to have a documented approach now, which is consistent with OCDS, and which we can change later once a broader discussion has taken place about the preferred approach(es), over creating more work and uncertainty for publishers, tool developers and helpdesk analysts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operational considerations are convincing. I've suggested changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more changes, sorry :)
After the version
description is updated, feel free to merge.
Adds a project package schema for metadata, based on the OCDS release package schema.