-
Notifications
You must be signed in to change notification settings - Fork 601
Articulate how to specify unpublished datasets #157
Comments
I agree we could make this clearer in the documentation. However I'm not in favor of adding a manual indicator of whether or not a public dataset has been released yet, because I think that information will get stale. People will upload files and forget to change the indicator, etc. |
Right, that was basically my same concern for trying to reuse an existing field for that purpose. Then again, the concern of not updating information could apply to every field and have bad consequences. Whatever convention we decide to document it would be great to reflect that in a report generated by a validator. If this is a dataset that you access by a URL (isn't that everything?) then "accessURL" is normally a required field. With the blank "accessURL" convention: If a dataset was entered and all the information was correct including the "modified" or "issued" date but the "accessURL" was accidentally left out, the validator would convey that it interpreted that entry as "unpublished" However if a dataset was entered and the "accessURL" was included and valid but "modified" and "issued" were left blank (or in the future) the validator would acknowledge that the dataset was published, but just throw an error for the missing required "modified" field and maybe the entry wouldn't even be ingested into the main data.gov catalog because it wasn't complete or well-formed enough. Are there any situations where a dataset could be published and not have a "accessURL"? I don't think so, but wanted to check. Maybe the reason this field isn't considered part of “Common Core” Required Fields is because it should always exist except for entries with unpublished data. Update: one instance where "accessURL" might be blank is if "webService" is used instead, so this convention would likely need to apply to both of those fields being blank. |
Perhaps dangerous, depending on client logic, but you could potentially draw a distinction between a |
That is another option, but it does feel even a bit more subtle and brittle. I think it's safest to equate missing fields with After thinking this through a little more: if we were to follow a blank/null convention like was being discussed with |
Hey folks,
Correct.
Also correct. This is why both fields are Required if Applicable and not Required. FWIW, I agree that adding a field just for this would cause as many problems as it helps. Also, the schema is locked at v1.0 for the immediate future, so our best move would be providing clarification in the documentation.
@philipashlock, would you be game to take a crack at a pull request that modifies the documentation to more clearly explain this? |
Some observations:
|
Agreed @gbinal. Lets not add a new field in the next version, but updated guidance to clarify. @philipashlock assigning this to you to take a pass at updated documentation to more clearly explain this if thats cool! |
You could have a convention that to be valid with a missing "accessURL" and "webService" fields, the "issued" date MUST also be in the future (and further, issue a warning if that date is too far in the future). The only other valid combination would be to have a URL and the date be in the past. This would guard against most of the potential mistakes mentioned upthread. |
This is addressed by 26a9326 |
@gbinal: the change is good, but I'm not sure it fully addresses the issue. I think we still have a problem where the |
Is |
@webmaven - the dataset. @dsmorgan77 - that's why I believe it would still be relevant for unpublished data. The field would be applicable to any dataset that exists. |
Yeah, which would allow organizing around identifying particular high-value unpublished datasets and organizing to get them published. I like it. |
Thank you for driving the conversation around this issue and helping to assemble the v1.1 metadata update. There appears to be strong consensus around this issue, which has been accepted in the v1.1 update and merged into Project Open Data. Project Open Data is a living project though. Please continue any conversations around how the schema can be improved with new issues and pull requests! It's important for government staff as well as the public to continue to collaborate to make the Open Data Policy ever better. Though the v1.1 update is a substantial update, future iterations do not have to be, so whatever your ideas - big or small - please continue to work with this community to improve how government manages and opens its data. |
The Policy clearly states that the public data listing "should include datasets that can be made publicly available but have not yet been released." (III > 3 > b) but the documentation does not clearly specify how to denote this.
One convention that has been discussed is to simply leave the accessURL field blank, but I think this makes it harder to do basic QA and validation. It would be better if there was a more explicit way of denoting an unpublished dataset. While perhaps equally imperfect, two other options within the schema would be to use an established convention within the 'accrualPeriodicity' or 'issued' fields, but those already have fairly strict conventions. I suppose the 'issued' field could be a date in the future, but that seems like it could easily be misused or not properly updated. It seems to me that ideally, this would be another field as part of the “Common Core” Required Fields but it's probably too late for that.
In any case, even if a blank accessURL is the best option, that should be documented somewhere.
The text was updated successfully, but these errors were encountered: