-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata #135
Comments
Some useful points to draw on from this discussion in OC4I: open-contracting/infrastructure#89 In the last conversation we had about this, we talked about a pattern of embedding metadata in the first object in the array and then assuming that everything that followed had the same metadata, until a new metadata object appeared. |
After a discussion with @kindly, this is a proposal on how to implement metadata in BODS. Add to the schema:
There are three top-line publication pattern styles and we'll need to offer guidance about which to use when:
Where metadata changes within a dataset, lists conforming to any of the above publication patterns can be appended to each other. *A url containing licence info about the dataset, so that data doesn't have to be republished if/when the licence changes. **We're going to need to produce a guidance page for publishers on Dates: looking at all the objects in the schema that have dates and making sure that it's absolutely clear to publishers what date is expected where. |
Some questions and reflections: VersionHave you considered using JSON Schema's Anyhow, I think it makes sense to have a metadata property for the version. ExtensionsI see only one discussion of an extension and no documentation for defining an extension. Shouldn't the extension mechanism be developed before adding its metadata property? In any case, the PublisherWhat are the use cases for this property? If a user cares about data's provenance (not a statement's provenance, which BODS covers), they typically store the URL at which the data was retrieved, whose domain name generally serves well enough to identify the publisher. Are there real use cases that aren't served by that approach? Source URLWhat are the use cases for this property? If it's to self-identify the canonical URL at which the statement was published, isn't it possible: for these URLs to change? for different publishers to re-publish the same statement? etc. Don't data users satisfy this use case by storing the URL at which data was retrieved? LicenseDo you mean the property will be defined as e.g. "A permanent URL with licensing information, but not a direct link to the license itself"? That circumvents the republication issue, but introduces a new issue, in that publishers need to maintain permanent URLs (a big ask). I suspect the URL is likely to be the dataset's web page, on which there'd be a description of the license – in which case, it's not likely to be permanent. (I have never seen a page that was solely to describe the license of a single dataset.) Taking a step back: What is the motivation for providing the license within the data? Is the use case of determining the license not well served already? Is there a real problem being solved? Date of publication
Indeed – in OCDS, we've seen dates interpreted as: when this information was first published (in OCDS or not); when this data was first published as OCDS; when this data was most recently published, etc. Anyhow, I think it makes sense to have a metadata property for the date of publication. CascadesIf I were a publisher, and if I had statements with different metadata, instead of trying to minimize data size by using cascades and references, I'd instead just repeat the metadata every time. It'd be much easier to implement with less likelihood of error. Publishing cascades seems very error-prone… I'd be in favor of not having references and not having cascades, and instead only offering the first publication pattern. The only downside is a small-percentage increase in dataset sizes. Pseudo-packaging
As I understand, the order of statements in a BODS list is unimportant except that referenced statements must appear before referring statements. With this new rule, the first statement becomes special. It seems like, instead of putting the metadata in a new resource (like in OCDS, for example), you decided to put it in the first resource in the list for practical reasons, not because conceptually it belongs there. This seems wrong, and may lead to issues if statements become separated… |
@jpmckinney - Thanks so much for your input on this. It's prompted some rethinking on my part. (And I'd be grateful to have @kindly's thoughts on this too.) For my part: ExtensionsYes, I don't think this property is for inclusion yet. But I thought I'd mention it for completeness' sake. Publisher and Source URLTaking your comments on these things together, I've realised that we'll need real clarity on what BODS metadata is. An obvious point, I know! We already have statement-metadata published within statements (e.g .the Source object which already has source.description and source.url). We're now suggesting that we have pseudo-package-metadata published at the statement level. So I think that we probably need to call the new metadata field, not Source URL, but Publisher URL. The Publisher (and Publisher URL) field values may be different from the Source and Source URL where statements are being republished on another platform. As you say 'If a user cares about data's provenance (not a statement's provenance, which BODS covers), they typically store the URL at which the data was retrieved'. True, but a statement may be published at the originating source, ingested and passed on through several platforms before reaching an end user. So a user might care about the data's provenance enough to want an audit trail of the platforms it's been through. Does this theoretical user exist? Thoughts on that welcome; we're in an evolving domain after all. If that use case is compelling then the pseudo-package-metadata would need to be mutable (unlike the rest of a statement) and each publisher of a statement would append their details to a Publisher array. I'd be interested on others' thoughts on this. License
I think the use case is 'platforms re-publishing data'. I'd like to hear the thoughts of @stevenday and @laurenceOO who work on OpenOwnership's Global Register. And the more I think about it, the more I think that this field belongs at statement-metadata level and not 'pseudo-package-metadata-level'. In which case the field value should be a direct link to the licence itself. After all, surely it is the licence at the point of original publication of the statement which would apply to its (re-)use? Pseudo-packaging, cascade logic, & metadata publication patterns
Indeed, and is a good reason to recommend that publishers embed metadata in each statement. I think the rationale for allowing (but not encouraging) publishers to make use of cascade logic for implying pseudo-package-metadata for statements is that the metadata 'stamp' on each statement might - at a certain point - create too much bloat. That becomes a real concern if we think that metadata might accrete to otherwise immutable statements as they are ingested and exported by multiple platforms. VersionWill think on JSON Schema's $schema property and get some thoughts down later when I've got a moment. |
Thanks for these notes! PublisherTo clarify:
I'm not aware of a use case that needs the full chain of re-publishers. Adding oneself to a chain of publishers is an extra step for aggregators, etc. The most likely thing a careful user would do with the publisher metadata is to find the original publisher's statement and compare (to verify that immutability was respected); I don't see what they would do with the intermediate publishers… For context, in OCDS, we have a So, I think a publisher block on each statement makes sense, but not as an array. LicenseI understand the motivation, but in practice putting licenses in data files breaks down – because licensing decisions are made outside the publication process. For example, if the UK makes a policy decision to switch from its Open Government License to a Creative Commons license, it simply changes the license metadata for the datasets in its data catalog (which it can do in a bulk operation). The users of those datasets would access the dataset's metadata (whether via API or browser) to see the new license. Now, if there were a direct link to the license at the statement-level, then the UK would have to re-publish every statement with a new license. This is an extra hurdle that it doesn't have to do for any other dataset (except OCDS, for which my concerns are the same). If it publishes all its statements via an API, it might only need to replace one constant in its code. But, BODS (and OCDS) allow bulk downloads, which are generally more complicated to regenerate, especially in the case of historical data, for which the original databases may no longer be online, the original scripts no longer operational, etc. So, it becomes a major headache just to update the license with a correct URL. Generally, a data catalog with a metadata API is the appropriate solution for indicating the licenses of datasets. It's not much trouble for data users to have to check the license in such a catalog. I have a project that pulls the ward boundaries of hundreds of municipalities in Canada, and the process is: (1) copy-paste the license URL when first adding the dataset to the project, (2) when pulling in new data, check that the URL still responds with HTTP 200. A more robust system can check for changes to the license text, but in practice this has worked for close to a decade, because most municipalities are not very good at persistent URLs. |
Thanks for the ping! I think I tend to agree with @jpmckinney that it's not too onerous on users to check elsewhere for licensing terms, and it's asynchronous with the act of actually using the data. What the OO register would probably do is defer to each independent publisher's licensing (from a centralised licensing page perhaps, since we'll have a lot to link to). Then try to highlight which data came from where so that users know what they can do with which data. To that end, I also agree that publisher details on each Statement would make sense, because that's the key thing in an aggregated bulk data sense - knowing where you can look/being able to identify it - rather than having the license there directly.
Just a technical point, but I wonder if this is a mis-placed concern, given the likelihood for data to be gzipped or similar? Duplicated data like this would compress down to a very similar filesize to the cascaded version being suggested I think. |
Thanks @jpmckinney and @stevenday Publisher
Yes
Yes It looks like we're settling on the idea that Publisher and Publisher URL relate to the original publisher of the statement. (And are therefore immutable.) There's an outstanding question about whether it's useful to have re-publisher metadata on statements. We can probably park that for the moment, until the 0.3 release. @stevenday - what do you think? LicenseOK. I'm thinking that we still include a Licence_info_URL, but not make it mandatory. To quote @jpmckinney, allow folk to include "A permanent URL with licensing information, but not a direct link to the license itself". Yes, it might change, but we can direct publishers that changes to metadata need not warrant re-publishing of statements. OverallYour point on the non-relevance of bloat is taken, @stevenday. I'm coming to the conclusion that all of the metadata we're talking really applies to the statement level anyway (there is no pseudo-package to speak of) and so we should not allow the cascade logic metadata publishing pattern. |
I agree. We've discussed this a little bit in the Register, but I don't think we have a clear enough idea of what we're actually going to publish (or re-publish) to do anything other than parking it at the moment. A lot of our concerns are with how we attribute the constituent parts of the data we've aggregated and transformed, rather than simply re-publishing others' data. In the re-publishing case, I think it would be fine to only have it retain the original publisher. A user would hopefully be aware they're not getting it from the original source (and could check we'd not changed it if they really cared, as @jpmckinney suggests). |
In conversation with @kindly we had the following thoughts on: Version
The main reason for not using the $schema property is that BODS-compliant data isn't solely defined by its JSON schema. Anything more you wanted to add on that @kindly? |
I don't know that There may be other reasons to not use |
Regarding license, I'd be on the conservative side of suggesting that it be left out (even if it points to a licensing page instead of the licence itself) until there is more demand from stakeholders. What has been the demand from BODS publishers, users, or other stakeholders? Otherwise, sounds good regarding Publisher and Publisher URL. |
@jpmckinney - I think that @kindly had a concern that $schema didn't play nicely with semantic versioning and patched version of schemas. I'll ask him to clarify. And on demand for 'licence' to be included in metadata: @stevenday, any thoughts? |
To be honest I don't have any particularly strong thoughts either way, but speaking as a potential user of BODS data, I don't think I'd use it. I guess that means I agree we should look for someone who has a strong desire for it before we include it? |
Just to share my most recent thinking on this.... Proposal no. 2Superceding proposal 1
I'm tempted to say that one of publisher and publisherUrl would be required. And publicationDate and schemaVersion would also be required. This seems much simpler than the first proposal. Thoughts? |
Looks good to me. I suggest either (1) Update: Schema.org uses |
Looks good me too. I think there still needs to be a field that contains the version in an accessible way. Just having I also believe any Nonetheless, I would be happy with both the |
^ That makes sense to me! |
@kindly - Does the source -> compile system we're using for BODS complicate use of $schema? |
If there's a possibility of future publisher-related fields, the second option is better, to avoid having lots of |
@jpmckinney - I take your point but I'm willing to go out on a limb and say that there aren't likely to be more publisher* fields. (Whereas I can imagine the whole publicationDetails block being re-used for a republisher down the line.) With the re-thinking on field names, we now have: A publicationDetails block, containing:
I know @kindly would like to see a (non-required) licence field in here. Have you got any thoughts on that @laurenceOO? |
In OCDS, there are two more fields for publishers (identifier and identifier scheme). I don't think these are needed in BODS at this time, but if added in future, it would be inconsistent with the rest of the schema to have a pattern of prefixing property names that describe the same real-world thing with the name of the thing (as done here), instead of the (more consistent) pattern of having an object with the name of the thing, and then sub-properties to describe the thing. What is the disadvantage of using a |
(laurence here): I think it would be quite useful to have a licence field:
|
@jpmckinney - re On the licence field question:
Yup. And that use case comes up against the immutability of statements. I think we've come to the conclusion in this discussion that this iteration of a metadata statement isn't going to handle the republishing scenario. The metadata in the proposed publicationDetails block applies to the original act of publication only (and is hence immutable).
This is a very good reason to include it, imo. |
Agree with the reasoning on license @kd-ods. Thanks for your work on this. A publisher object makes sense to me. |
If a property for license is included, I recommend the more common spelling |
@kindly - I'm working on a new 0.2-dev-metadata-135 branch to implement this. How does this look to you? Seems to work as expected when tested. |
See PR #193. Ready for review now @ScatteredInk or @kindly. |
Adds required publicationDetails component to statements - closes #135
Closing - this is done for now. Thanks all. |
Expanded from #126
0.1-rc:
id
to point to a uri.For (1), the question is whether changing this will affect our tooling and the use of sub-schemas.
For (2), @kindly and @odscjames have suggested the option of a metadata statement as the first item in the top-level array.
The text was updated successfully, but these errors were encountered: