Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata #135

Closed
ScatteredInk opened this issue Nov 21, 2018 · 28 comments
Closed

Metadata #135

ScatteredInk opened this issue Nov 21, 2018 · 28 comments
Assignees
Milestone

Comments

@ScatteredInk
Copy link
Collaborator

ScatteredInk commented Nov 21, 2018

Expanded from #126

0.1-rc:

  1. is not using id to point to a uri.
  2. does not have a way to point to the version in the data itself.

For (1), the question is whether changing this will affect our tooling and the use of sub-schemas.

For (2), @kindly and @odscjames have suggested the option of a metadata statement as the first item in the top-level array.

@ScatteredInk ScatteredInk added this to the 0.2-rc milestone Nov 21, 2018
@ScatteredInk
Copy link
Collaborator Author

Some useful points to draw on from this discussion in OC4I: open-contracting/infrastructure#89

In the last conversation we had about this, we talked about a pattern of embedding metadata in the first object in the array and then assuming that everything that followed had the same metadata, until a new metadata object appeared.

@kd-ods
Copy link
Collaborator

kd-ods commented May 3, 2019

After a discussion with @kindly, this is a proposal on how to implement metadata in BODS.

Add to the schema:

  • A metadata component to be included in each statement type. Not a required property.
  • A metadata reference property (name to be decided) whose value is a statementID. Not a required property.
  • A statement may have either or neither of the above but not both.
  • The first statement in a BODS list must have a metadata object
  • Metadata properties will be: Schema version, Publisher, License*, Date of publication**, Source_URI, Extensions
  • The metadata of a given statement is determined cascade-style (if it doesn't meet the first condition, then the second is considered and so on), and is given by:
    • Metadata contained in the statement
    • An explicit metadata reference to a previous statement in the list
    • The metadata, derived or otherwise, of the previous statement in the list

There are three top-line publication pattern styles and we'll need to offer guidance about which to use when:

  1. Metadata in all statements in a list (@kindly's preference is that this should be the default advice to publishers)
  2. Publish metadata in the list's first statement and first statement only (all subsequent statements then share that metadata).
  3. Publish metadata in the list's first statement then, in other subsequent statements, include an explicit reference to the first statement's metadata. (This makes explicit what's happening in 2.)

Where metadata changes within a dataset, lists conforming to any of the above publication patterns can be appended to each other.

*A url containing licence info about the dataset, so that data doesn't have to be republished if/when the licence changes.

**We're going to need to produce a guidance page for publishers on Dates: looking at all the objects in the schema that have dates and making sure that it's absolutely clear to publishers what date is expected where.

@kd-ods kd-ods self-assigned this May 3, 2019
@jpmckinney
Copy link

jpmckinney commented May 4, 2019

Some questions and reflections:

Version

Have you considered using JSON Schema's $schema property to point to the version of the BODS schema being used? It has better tool support than a custom property.

Anyhow, I think it makes sense to have a metadata property for the version.

Extensions

I see only one discussion of an extension and no documentation for defining an extension. Shouldn't the extension mechanism be developed before adding its metadata property?

In any case, the $schema property can point to an extended schema. open-contracting/standard#426 (comment) The dialect of JSON Schema used by BODS and OCDS use filenames instead of URLs to refer to codelists, so just having an extended schema would be lacking the codelists; it's maybe worth considering using URLs for codelists.

Publisher

What are the use cases for this property?

If a user cares about data's provenance (not a statement's provenance, which BODS covers), they typically store the URL at which the data was retrieved, whose domain name generally serves well enough to identify the publisher.

Are there real use cases that aren't served by that approach?

Source URL

What are the use cases for this property?

If it's to self-identify the canonical URL at which the statement was published, isn't it possible: for these URLs to change? for different publishers to re-publish the same statement? etc.

Don't data users satisfy this use case by storing the URL at which data was retrieved?

License

Do you mean the property will be defined as e.g. "A permanent URL with licensing information, but not a direct link to the license itself"? That circumvents the republication issue, but introduces a new issue, in that publishers need to maintain permanent URLs (a big ask). I suspect the URL is likely to be the dataset's web page, on which there'd be a description of the license – in which case, it's not likely to be permanent. (I have never seen a page that was solely to describe the license of a single dataset.)

Taking a step back: What is the motivation for providing the license within the data? Is the use case of determining the license not well served already? Is there a real problem being solved?

Date of publication

making sure that it's absolutely clear to publishers what date is expected where

Indeed – in OCDS, we've seen dates interpreted as: when this information was first published (in OCDS or not); when this data was first published as OCDS; when this data was most recently published, etc.

Anyhow, I think it makes sense to have a metadata property for the date of publication.

Cascades

If I were a publisher, and if I had statements with different metadata, instead of trying to minimize data size by using cascades and references, I'd instead just repeat the metadata every time. It'd be much easier to implement with less likelihood of error. Publishing cascades seems very error-prone…

I'd be in favor of not having references and not having cascades, and instead only offering the first publication pattern. The only downside is a small-percentage increase in dataset sizes.

Pseudo-packaging

The first statement in a BODS list must have a metadata object

As I understand, the order of statements in a BODS list is unimportant except that referenced statements must appear before referring statements. With this new rule, the first statement becomes special. It seems like, instead of putting the metadata in a new resource (like in OCDS, for example), you decided to put it in the first resource in the list for practical reasons, not because conceptually it belongs there. This seems wrong, and may lead to issues if statements become separated…

@kd-ods
Copy link
Collaborator

kd-ods commented May 8, 2019

@jpmckinney - Thanks so much for your input on this. It's prompted some rethinking on my part. (And I'd be grateful to have @kindly's thoughts on this too.) For my part:

Extensions

Yes, I don't think this property is for inclusion yet. But I thought I'd mention it for completeness' sake.

Publisher and Source URL

Taking your comments on these things together, I've realised that we'll need real clarity on what BODS metadata is. An obvious point, I know!

We already have statement-metadata published within statements (e.g .the Source object which already has source.description and source.url). We're now suggesting that we have pseudo-package-metadata published at the statement level. So I think that we probably need to call the new metadata field, not Source URL, but Publisher URL.

The Publisher (and Publisher URL) field values may be different from the Source and Source URL where statements are being republished on another platform. As you say 'If a user cares about data's provenance (not a statement's provenance, which BODS covers), they typically store the URL at which the data was retrieved'. True, but a statement may be published at the originating source, ingested and passed on through several platforms before reaching an end user. So a user might care about the data's provenance enough to want an audit trail of the platforms it's been through. Does this theoretical user exist? Thoughts on that welcome; we're in an evolving domain after all.

If that use case is compelling then the pseudo-package-metadata would need to be mutable (unlike the rest of a statement) and each publisher of a statement would append their details to a Publisher array. I'd be interested on others' thoughts on this.

License

Taking a step back: What is the motivation for providing the license within the data? Is the use case of determining the license not well served already? Is there a real problem being solved?

I think the use case is 'platforms re-publishing data'. I'd like to hear the thoughts of @stevenday and @laurenceOO who work on OpenOwnership's Global Register.

And the more I think about it, the more I think that this field belongs at statement-metadata level and not 'pseudo-package-metadata-level'. In which case the field value should be a direct link to the licence itself. After all, surely it is the licence at the point of original publication of the statement which would apply to its (re-)use?

Pseudo-packaging, cascade logic, & metadata publication patterns

[the first statement becoming special] seems wrong, and may lead to issues if statements become separated…

Indeed, and is a good reason to recommend that publishers embed metadata in each statement.

I think the rationale for allowing (but not encouraging) publishers to make use of cascade logic for implying pseudo-package-metadata for statements is that the metadata 'stamp' on each statement might - at a certain point - create too much bloat. That becomes a real concern if we think that metadata might accrete to otherwise immutable statements as they are ingested and exported by multiple platforms.

Version

Will think on JSON Schema's $schema property and get some thoughts down later when I've got a moment.

@jpmckinney
Copy link

jpmckinney commented May 8, 2019

Thanks for these notes!

Publisher

To clarify:

I'm not aware of a use case that needs the full chain of re-publishers. Adding oneself to a chain of publishers is an extra step for aggregators, etc. The most likely thing a careful user would do with the publisher metadata is to find the original publisher's statement and compare (to verify that immutability was respected); I don't see what they would do with the intermediate publishers…

For context, in OCDS, we have a publisher block at the package-level, and we intend to add a publisher block at the release-level (statement-level in BODS) open-contracting/standard#325 (comment). The block will be immutable and about the original publisher. Our reasons are: (1) Releases can be separated from their packages, so it's relevant to put the publisher on the release. (2) Re-publishers can re-package releases (and therefore put themselves as the publishers of those), but the releases themselves should remain immutable (and keep their original publishers).

So, I think a publisher block on each statement makes sense, but not as an array.

License

I understand the motivation, but in practice putting licenses in data files breaks down – because licensing decisions are made outside the publication process.

For example, if the UK makes a policy decision to switch from its Open Government License to a Creative Commons license, it simply changes the license metadata for the datasets in its data catalog (which it can do in a bulk operation). The users of those datasets would access the dataset's metadata (whether via API or browser) to see the new license.

Now, if there were a direct link to the license at the statement-level, then the UK would have to re-publish every statement with a new license. This is an extra hurdle that it doesn't have to do for any other dataset (except OCDS, for which my concerns are the same). If it publishes all its statements via an API, it might only need to replace one constant in its code. But, BODS (and OCDS) allow bulk downloads, which are generally more complicated to regenerate, especially in the case of historical data, for which the original databases may no longer be online, the original scripts no longer operational, etc. So, it becomes a major headache just to update the license with a correct URL.

Generally, a data catalog with a metadata API is the appropriate solution for indicating the licenses of datasets. It's not much trouble for data users to have to check the license in such a catalog. I have a project that pulls the ward boundaries of hundreds of municipalities in Canada, and the process is: (1) copy-paste the license URL when first adding the dataset to the project, (2) when pulling in new data, check that the URL still responds with HTTP 200. A more robust system can check for changes to the license text, but in practice this has worked for close to a decade, because most municipalities are not very good at persistent URLs.

@stevenday
Copy link

I think the use case is 'platforms re-publishing data'. I'd like to hear the thoughts of @stevenday and @laurenceOO who work on OpenOwnership's Global Register.

Thanks for the ping! I think I tend to agree with @jpmckinney that it's not too onerous on users to check elsewhere for licensing terms, and it's asynchronous with the act of actually using the data.

What the OO register would probably do is defer to each independent publisher's licensing (from a centralised licensing page perhaps, since we'll have a lot to link to). Then try to highlight which data came from where so that users know what they can do with which data. To that end, I also agree that publisher details on each Statement would make sense, because that's the key thing in an aggregated bulk data sense - knowing where you can look/being able to identify it - rather than having the license there directly.

I think the rationale for allowing (but not encouraging) publishers to make use of cascade logic for implying pseudo-package-metadata for statements is that the metadata 'stamp' on each statement might - at a certain point - create too much bloat.

Just a technical point, but I wonder if this is a mis-placed concern, given the likelihood for data to be gzipped or similar? Duplicated data like this would compress down to a very similar filesize to the cascaded version being suggested I think.

@kd-ods
Copy link
Collaborator

kd-ods commented May 13, 2019

Thanks @jpmckinney and @stevenday

Publisher

Wouldn't the Publisher be different from the Source in all cases except where the Publisher and Source are both an official register? (from looking at the source type codelist)

Yes

As I understand, the Publisher URL would identify the publisher (https://www.gov.uk/government/organisations/companies-house), not the dataset (http://download.companieshouse.gov.uk/en_pscdata.html) or the pseudo-package.

Yes

It looks like we're settling on the idea that Publisher and Publisher URL relate to the original publisher of the statement. (And are therefore immutable.)

There's an outstanding question about whether it's useful to have re-publisher metadata on statements. We can probably park that for the moment, until the 0.3 release. @stevenday - what do you think?

License

OK. I'm thinking that we still include a Licence_info_URL, but not make it mandatory. To quote @jpmckinney, allow folk to include "A permanent URL with licensing information, but not a direct link to the license itself". Yes, it might change, but we can direct publishers that changes to metadata need not warrant re-publishing of statements.

Overall

Your point on the non-relevance of bloat is taken, @stevenday. I'm coming to the conclusion that all of the metadata we're talking really applies to the statement level anyway (there is no pseudo-package to speak of) and so we should not allow the cascade logic metadata publishing pattern.

@stevenday
Copy link

There's an outstanding question about whether it's useful to have re-publisher metadata on statements. We can probably park that for the moment, until the 0.3 release. @stevenday - what do you think?

I agree. We've discussed this a little bit in the Register, but I don't think we have a clear enough idea of what we're actually going to publish (or re-publish) to do anything other than parking it at the moment.

A lot of our concerns are with how we attribute the constituent parts of the data we've aggregated and transformed, rather than simply re-publishing others' data. In the re-publishing case, I think it would be fine to only have it retain the original publisher. A user would hopefully be aware they're not getting it from the original source (and could check we'd not changed it if they really cared, as @jpmckinney suggests).

@kd-ods
Copy link
Collaborator

kd-ods commented May 13, 2019

In conversation with @kindly we had the following thoughts on:

Version

Have you considered using JSON Schema's $schema property to point to the version of the BODS schema being used? It has better tool support than a custom property.

The main reason for not using the $schema property is that BODS-compliant data isn't solely defined by its JSON schema.

Anything more you wanted to add on that @kindly?

@jpmckinney
Copy link

I don't know that $schema has the semantics "this fully defines compliant data". I think it has at least the semantics, "If the JSON data doesn't validate against the $schema, then the JSON data is non-compliant". i.e. the semantics of $schema are that it sets necessary but not sufficient criteria for compliance.

There may be other reasons to not use $schema, but I think the point about not being solely defined by the JSON Schema isn't a reason against using it. It seems useful to take advantage of existing tool support for $schema, even if it isn't the only rule that BODS data needs to respect (I assume that, like OCDS, there are other rules not expressible in either schema or codelists).

@jpmckinney
Copy link

Regarding license, I'd be on the conservative side of suggesting that it be left out (even if it points to a licensing page instead of the licence itself) until there is more demand from stakeholders.

What has been the demand from BODS publishers, users, or other stakeholders?

Otherwise, sounds good regarding Publisher and Publisher URL.

@kd-ods
Copy link
Collaborator

kd-ods commented May 17, 2019

@jpmckinney - I think that @kindly had a concern that $schema didn't play nicely with semantic versioning and patched version of schemas. I'll ask him to clarify. And on demand for 'licence' to be included in metadata: @stevenday, any thoughts?

@stevenday
Copy link

And on demand for 'licence' to be included in metadata: @stevenday, any thoughts?

To be honest I don't have any particularly strong thoughts either way, but speaking as a potential user of BODS data, I don't think I'd use it. I guess that means I agree we should look for someone who has a strong desire for it before we include it?

@kd-ods
Copy link
Collaborator

kd-ods commented May 17, 2019

Just to share my most recent thinking on this....

Proposal no. 2

Superceding proposal 1

  • Have a required publicationDetails block on each statement.
  • In the 0.2 release it would contain the following properties
    • publisher
    • publisherUrl
    • publicationDate
    • schemaVersion (use of $schema yet to be decided)
    • (possibly licence info too. Still under discussion)
  • The block would be immutable and apply to the original publisher of the statement.

I'm tempted to say that one of publisher and publisherUrl would be required. And publicationDate and schemaVersion would also be required.

This seems much simpler than the first proposal. Thoughts?

@jpmckinney
Copy link

jpmckinney commented May 17, 2019

Looks good to me. I suggest either (1) publisherName instead of publisher; or (2) make publisher an object with properties name and url.

Update: Schema.org uses datePublished instead of publicationDate. The choice can depend on which name is more consistent with other date properties.

@kindly
Copy link
Collaborator

kindly commented May 20, 2019

Looks good me too.

I think there still needs to be a field that contains the version in an accessible way.

Just having $schema means users will have to parse the URL. For example the schema URL would be something like http://standard.openownership.org/schema/v0.1.1/entity.json. So any user would need to know how to extract the v0.1.1 part out if they wanted to have a SemVer version of the schema. In that case we could argue that the whole URL is the label that marks the version just not the SemVer bit but I am not sure that is a good option as when you refer to a version you would need to quote the whole URL each time. Also the URL may not contain dots for example http://standard.openownership.org/en/v0-1/ is the docs URL currently making parsing even harder.

I also believe any $schema that are published to a certain version/patch should be immutable. This is so we do not change the validation after being published. So for example if the $schema was http://standard.openownership.org/schema/v0.1/entity.json then we should never remove or change the schema at that URL. This means if we do a patch release http://standard.openownership.org/schema/v0.1.1/entity.json we should not go and update http://standard.openownership.org/schema/v0.1/entity.json with the latest patch. So in the $schema we should encourage putting the exact patch release. However, if we do that than it is even harder to parse out the version i.e 0.1 that we want to say the statement conforms to.

Nonetheless, I would be happy with both the $schema with major.minor.patch URL saying what this statement technically validates against AND a version field saying simply this statement conforms to everything within major.minor outside pure technical validation.

@jpmckinney
Copy link

^ That makes sense to me!

@kd-ods
Copy link
Collaborator

kd-ods commented May 21, 2019

@kindly - Does the source -> compile system we're using for BODS complicate use of $schema?

@jpmckinney
Copy link

I suggest either (1) publisherName instead of publisher; or (2) make publisher an object with properties name and url.

If there's a possibility of future publisher-related fields, the second option is better, to avoid having lots of publisher* fields.

@kd-ods
Copy link
Collaborator

kd-ods commented May 22, 2019

@jpmckinney - I take your point but I'm willing to go out on a limb and say that there aren't likely to be more publisher* fields. (Whereas I can imagine the whole publicationDetails block being re-used for a republisher down the line.)

With the re-thinking on field names, we now have:

A publicationDetails block, containing:

  • publisherName (this or publisherUrl required)
  • publisherUrl (this or publisherName required)
  • publicationDate (required)
  • bodsVersion (major.minor release, required)
  • licence?

I know @kindly would like to see a (non-required) licence field in here. Have you got any thoughts on that @laurenceOO?

@jpmckinney
Copy link

In OCDS, there are two more fields for publishers (identifier and identifier scheme). I don't think these are needed in BODS at this time, but if added in future, it would be inconsistent with the rest of the schema to have a pattern of prefixing property names that describe the same real-world thing with the name of the thing (as done here), instead of the (more consistent) pattern of having an object with the name of the thing, and then sub-properties to describe the thing.

What is the disadvantage of using a publisher object? This hasn't been explained.

@openownership-bot
Copy link
Contributor

(laurence here): I think it would be quite useful to have a licence field:

  • to facilitate long chains of republishing, especially when the terms are restrictive (sharealike, NC, no-Derivs etc...) and have to be passed down the chain. there may be a separate consideration about being able to cater for an audit of the full chain in the schema itself.

  • to encourage the source publisher to even consider it. at the moment it is a great struggle, as a lot of national registers don't even have a legal provision about it, and i believe this could be part of our advocacy effort.

@kd-ods
Copy link
Collaborator

kd-ods commented May 22, 2019

@jpmckinney - re publisher object, I'll think on it. I agree that if there's any chance there will be more than two publisher-related fields it would make sense.

On the licence field question:

to facilitate long chains of republishing, especially when the terms are restrictive (sharealike, NC, no-Derivs etc...) and have to be passed down the chain. there may be a separate consideration about being able to cater for an audit of the full chain in the schema itself.

Yup. And that use case comes up against the immutability of statements. I think we've come to the conclusion in this discussion that this iteration of a metadata statement isn't going to handle the republishing scenario. The metadata in the proposed publicationDetails block applies to the original act of publication only (and is hence immutable).

to encourage the source publisher to even consider it. at the moment it is a great struggle, as a lot of national registers don't even have a legal provision about it, and i believe this could be part of our advocacy effort.

This is a very good reason to include it, imo.

@ScatteredInk
Copy link
Collaborator Author

Agree with the reasoning on license @kd-ods. Thanks for your work on this.

A publisher object makes sense to me.

@jpmckinney
Copy link

If a property for license is included, I recommend the more common spelling license used by Dublin Core, Schema.org, etc.

@kd-ods
Copy link
Collaborator

kd-ods commented May 24, 2019

@kindly - I'm working on a new 0.2-dev-metadata-135 branch to implement this. How does this look to you?

Seems to work as expected when tested.

@kd-ods
Copy link
Collaborator

kd-ods commented Jun 7, 2019

See PR #193. Ready for review now @ScatteredInk or @kindly.

ScatteredInk added a commit that referenced this issue Jun 13, 2019
Adds required publicationDetails component to statements - closes #135
@ScatteredInk
Copy link
Collaborator Author

Closing - this is done for now. Thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants