Skip to content
This repository was archived by the owner on Jun 18, 2024. It is now read-only.

Articulate how to specify unpublished datasets #157

Closed
philipashlock opened this issue Sep 27, 2013 · 14 comments
Closed

Articulate how to specify unpublished datasets #157

philipashlock opened this issue Sep 27, 2013 · 14 comments

Comments

@philipashlock
Copy link
Contributor

The Policy clearly states that the public data listing "should include datasets that can be made publicly available but have not yet been released." (III > 3 > b) but the documentation does not clearly specify how to denote this.

One convention that has been discussed is to simply leave the accessURL field blank, but I think this makes it harder to do basic QA and validation. It would be better if there was a more explicit way of denoting an unpublished dataset. While perhaps equally imperfect, two other options within the schema would be to use an established convention within the 'accrualPeriodicity' or 'issued' fields, but those already have fairly strict conventions. I suppose the 'issued' field could be a date in the future, but that seems like it could easily be misused or not properly updated. It seems to me that ideally, this would be another field as part of the “Common Core” Required Fields but it's probably too late for that.

In any case, even if a blank accessURL is the best option, that should be documented somewhere.

@MarinaNitze
Copy link
Contributor

I agree we could make this clearer in the documentation. However I'm not in favor of adding a manual indicator of whether or not a public dataset has been released yet, because I think that information will get stale. People will upload files and forget to change the indicator, etc.

@philipashlock
Copy link
Contributor Author

Right, that was basically my same concern for trying to reuse an existing field for that purpose. Then again, the concern of not updating information could apply to every field and have bad consequences.

Whatever convention we decide to document it would be great to reflect that in a report generated by a validator. If this is a dataset that you access by a URL (isn't that everything?) then "accessURL" is normally a required field.

With the blank "accessURL" convention:

If a dataset was entered and all the information was correct including the "modified" or "issued" date but the "accessURL" was accidentally left out, the validator would convey that it interpreted that entry as "unpublished"

However if a dataset was entered and the "accessURL" was included and valid but "modified" and "issued" were left blank (or in the future) the validator would acknowledge that the dataset was published, but just throw an error for the missing required "modified" field and maybe the entry wouldn't even be ingested into the main data.gov catalog because it wasn't complete or well-formed enough.

Are there any situations where a dataset could be published and not have a "accessURL"? I don't think so, but wanted to check. Maybe the reason this field isn't considered part of “Common Core” Required Fields is because it should always exist except for entries with unpublished data.

Update: one instance where "accessURL" might be blank is if "webService" is used instead, so this convention would likely need to apply to both of those fields being blank.

@konklone
Copy link
Contributor

Perhaps dangerous, depending on client logic, but you could potentially draw a distinction between a null value and a missing field.

@philipashlock
Copy link
Contributor Author

That is another option, but it does feel even a bit more subtle and brittle. I think it's safest to equate missing fields with null.

After thinking this through a little more: if we were to follow a blank/null convention like was being discussed with accessURL then it seems like it would be more appropriate to apply that approach to the modified field. That field is otherwise always required so it doesn't have the either/or issues that accessURL and webService have and it feels a little more semantically correct since this is meant to denote a known future state rather than a resource that doesn't exist at all. In any case modified definitely shouldn't be listed as "Required: Yes, always" because there is no correct date to put there for an unpublished dataset. This is of course based on the interpretation of modified as being from the perspective of the public/published version rather than the government/internal version, yet I guess that's not entirely clear either.

@gbinal
Copy link
Contributor

gbinal commented Nov 1, 2013

Hey folks,

Maybe the reason this field isn't considered part of “Common Core” Required Fields is because it should always exist except for entries with unpublished data.

Correct.

Update: one instance where "accessURL" might be blank is if "webService" is used instead, so this convention would likely need to apply to both of those fields being blank.

Also correct. This is why both fields are Required if Applicable and not Required.

FWIW, I agree that adding a field just for this would cause as many problems as it helps. Also, the schema is locked at v1.0 for the immediate future, so our best move would be providing clarification in the documentation.

In any case, even if a blank accessURL is the best option, that should be documented somewhere.

@philipashlock, would you be game to take a crack at a pull request that modifies the documentation to more clearly explain this?

@philipashlock philipashlock added this to the Next Version of Common Core Metadata Schema milestone May 8, 2014
@gbinal
Copy link
Contributor

gbinal commented Jul 17, 2014

Some observations:

  • There seems to be strong consensus against creating a new field to address this issue.
  • Better documentation to clarify this would be useful.

@philipashlock philipashlock modified the milestone: Next Version of Common Core Metadata Schema (1.0 -> 1.1.) Jul 24, 2014
@haleyvandyck
Copy link
Contributor

Agreed @gbinal. Lets not add a new field in the next version, but updated guidance to clarify. @philipashlock assigning this to you to take a pass at updated documentation to more clearly explain this if thats cool!

@webmaven
Copy link

You could have a convention that to be valid with a missing "accessURL" and "webService" fields, the "issued" date MUST also be in the future (and further, issue a warning if that date is too far in the future). The only other valid combination would be to have a URL and the date be in the past.

This would guard against most of the potential mistakes mentioned upthread.

gbinal added a commit that referenced this issue Sep 8, 2014
@gbinal
Copy link
Contributor

gbinal commented Sep 8, 2014

This is addressed by 26a9326

@dsmorgan77
Copy link
Contributor

@gbinal: the change is good, but I'm not sure it fully addresses the issue. I think we still have a problem where the modified field shouldn't be required for an unpublished data set.

@webmaven
Copy link

webmaven commented Oct 3, 2014

Is modified only intended to refer to the data set, or to the record about the data set as well?

@gbinal
Copy link
Contributor

gbinal commented Oct 3, 2014

@webmaven - the dataset.

@dsmorgan77 - that's why I believe it would still be relevant for unpublished data. The field would be applicable to any dataset that exists.

@webmaven
Copy link

webmaven commented Oct 3, 2014

Yeah, which would allow organizing around identifying particular high-value unpublished datasets and organizing to get them published. I like it.

@gbinal
Copy link
Contributor

gbinal commented Nov 7, 2014

Thank you for driving the conversation around this issue and helping to assemble the v1.1 metadata update.

There appears to be strong consensus around this issue, which has been accepted in the v1.1 update and merged into Project Open Data. Project Open Data is a living project though. Please continue any conversations around how the schema can be improved with new issues and pull requests!

It's important for government staff as well as the public to continue to collaborate to make the Open Data Policy ever better. Though the v1.1 update is a substantial update, future iterations do not have to be, so whatever your ideas - big or small - please continue to work with this community to improve how government manages and opens its data.

@gbinal gbinal closed this as completed Nov 7, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants