-
Notifications
You must be signed in to change notification settings - Fork 591
Guidance on defining collections to group datasets released as a series or in fragments #258
Comments
👍 Yes this is totally needed. We have a number of datasets at Energy that we would like to consolidate into a single listing rather than have 100s of entries for each year x state. It sounds like the parent/child "collection" concept used by data.gov is somewhat different than the entry/distribution of datasets currently used by the schema. Should the guidance direct people to put "children" datasets in the |
Collections are absolutely needed. I could argue either way on the ideal publication path (many of DOT's (current) data customers are states or cities who just aren't interested in downloading the entire Nation's data file and filtering out their information ... we should serve both). The problems with this are that certain properties will need to filter down to the data file itself. A collection may have a temporal coverage of 1975-present, but an individual file may cover only a single year. A collection may have a geographic coverage of "United States" but a single file may have a geographic coverage of "Alabama." Download URLs will be on a file-by-file basis. Formats might change over time. Data dictionaries may change over time as data elements might be changed. Clear examples where the collections concept is needed include:
I would highly recommend coordinating with the Federal statistical community on how collections might support them. Groups such as the Statistical Community of Practice and Engagement (SCOPE) will have helpful suggestions on how best to implement. |
@dsmorgan77 the problems you described seemed to be well served by the model of having a child parent relationship where both the child and the parent would essentially be a fully qualified entry in a data.json. Does that answer your question @cew821? So no, the children wouldn't be listed under the distribution. Instead the parent wouldn't list any distributions, but it might have a flag indicating that it's a collection, perhaps Dan raises a good point about being able to be more precise about the temporal bounds for each individual file if something is released in fragments by region, but I think it would still be acceptable to package all those together as |
Yes that makes sense. I think your proposal for additional optional data field indicating the parent makes sense (at least on the child record). I would think you would want to use Adopting this approach would also open the possibility for nested parent/child relationships, which I think should be fine (i.e. a record could both be a child of a parent record, and itself be the parent of children records). |
This is great. I like how the SKOS core guide deals with the "collection" issue: http://www.w3.org/TR/2005/WD-swbp-skos-core-guide-20051102/ - using broader/narrower/associated relationships and meaningful collections of concepts. Works well with the standards that POD schema already employ. In case you haven't seen it, HHS is using these standards for its catalog data.jsonld implementation: |
The parent child issue also presents a different problem for my agency. In our case the parent dataset is often not releasable (PII or other reasons) so only the "derivative" or child data set is released. We are still obligated to have the parent/internal only dataset in the directory. The child is often not even stored on the same system (we have hundreds of custom systems for collecting and storing datasets). We are planning on using the unique identifier of the parent as an attribute of the derivative (open) dataset, but are only now considering whether that's looks like a concatenation of the parent U.I + a serialization number or if its two UIs(either way displaying as a collection on data,gov seems straightforward) As our internal catalogue will be in the form of a database that will transmit a file in the correct format, I am wondering how best to construct it in order to minimize the record maintenance. If its a full record, then there will be much redundancy of input which is a source of errors and loss of linkage to say nothing of data steward resistance to double inputs. As we consider the parent child issues is this a concern of anyone? This is not a big deal if 100 datasets, but we have estimated 8-10,000. Has anyone sorted this out? |
Ranking08, could we solve that kind of problem with entity relationship diagrams (that we could somehow translate into a flat file)? ERDs are a lot like linked data. Similar solution for different problems. Also NIST is building the NIEM. Your agency might be a good candidate for them as they move into beta. |
+1 for collections. In HHS's DMS, each dataset is optionally given a Group Name, which is a string. Any datasets with the same group name are organized together in search results. We can map group names to collections in our data.json file, but we don't have any additional metadata for the group itself so we'd have to make up metadata for the parent dataset. I think it might be simpler to put |
Thanks Josh.. one point in our case is the parent (for instance say it has PII) in most cases, will not be exposed on data.gov , or any other outward facing site, only the child... so the parent record will be in the internal "all dataset" catalogue and not in the public JSON file. Still, the children should all sort together .. so in this case the relationship should be maintained at the child record and not on the parent record... thoughts? |
@raking08: The parent could be a placeholder, I guess? It would be odd to have an identifier to something but the something isn't listed in the file. (Even if it has PII, it could still be listed but with |
interesting thought.. but there is significant resistance to having any exposure of the parent ( consider the rest of the metadata that would be exposed) so I do not think listing the parent would be allowed, but I will pose that to security. But as long as the children would be exposed with their different tags for spatial and temporal and any other specifics then it should be OK.. the placeholder could be interesting if there were say 100 children, but you would still need to be able to drill into the placeholder to get to the specific dataset you wanted. How do you see that working? |
@JoshData I think the children need to have their "parentID" (i.e. the foreign key) in their record, since it's a one -> many relationship between parent and children, no? I suppose you could also include a list of all the children in a |
@raking08 I agree with the placeholder idea - even if all the record contained was a |
Hi CEW, |
@raking08 I don't see the problem with your particular instance of mentioning a dataset that has PII in your public data listing. I know it's optional, but the public already has notice that there is a system that collects data containing PII. How? Because your agency is publishing Privacy Impact Assessments and System of Records Notices telling them just that. In short, the public already knows. And, the notion that the minimum required metadata is somehow "too much" metadata is kind of ludicrous. The title, description, and contact point (for a privacy-sensitive dataset, that'd just be the redress point of contact), and agency identifiers are all innocuous and, again, already made public when you release a PIA and a SORN. |
Dear dsmorgan77, |
@raking08 and other interested parties, it could be helpful to have an offline discussion about the enterprise inventory and metadata for agencies that are facing higher barriers to exposing their metadata for restricted datasets. It would be helpful to discuss good/best practices, lessons learned, etc. Ping me (first.last@hhs.gov) via your .gov email address, if you would like to join an informal conf call discussion. |
We've been working on this for a little while and have determined that parent-child relationships force the user to navigate in a hierarchical fashion when often data is related in a multidimensional 'hyper-cube' of attributes / parameters. In other words, I might be interested in other data based on any one of a number of dimensions present in that dataset (e.g., who: subject, where: location, when: years covered, etc.). My personal take away from all the metadata modeling work that I've been doing for the past year is that it's in order to develop / discover a common metadata standard (potentially versioned to accommodate change) that can adequately describe all relevant datasets so that a single nuanced difference (e.g., as described by Philip at the beginning of this thread) can serve to both filter out all non-relevant datasets while grouping the most closely related datasets in a manner that enables the relationship to be implied by the proximity of two or more datasets to each other in the results of a given query. |
It sounds like there might be broader interpretations of a collection than what I originally had in mind. What prompted this (and what might be a more solvable initial scope to approach here) are collections that are made up of nothing more than subsets of a master dataset. This means there are no transformations or other alterations going from the parent to the child other than excluding the rows that don't fit into the subset. In other words, the parent and children should have the same schema and column headings and you should be able to aggregate or concatenate all children to represent the parent. I don't think this has to be a strict requirement, but it would be helpful to be clear and consistent about what's implied by the relationship of collections. Some examples of this are the portions of TIGER Line data that are not available as a national file and require you to download all subsets and merge them together if you want national coverage. You could also argue that TIGER data is released as a time series with each date of collection and publication so it would also make sense to put each year as a child of a parent for all historical TIGER data. While this would extend the family tree another level, I think you could still indicate that relationship without it becoming a strict hierarchy that users are forced to navigate. If you were to aggregate all releases of TIGER data as a time series it's pretty clear that you couldn't aggregate those all into one master file because of changes in the schema and format of the data from year to year, but I think it would still be valuable to indicate they're all part of the same overall collection. One way to distinguish between these different ways of defining collections would be that if the collection is comprised of nothing more than subsets of a master dataset, than the parent of the collection should also provide a merged file of everything in the collection as a listing in the Although it would probably be better to be explicit about the relationship if we're allowing a variety of different kinds. |
Ah, I see. So, in this case, would we be talking about a series rather than a parent:child relationship? |
To date I think the only way we've discussed identifying something as being part of a series is here in this issue as being part of a collection, so yes, as a parent:child relationship. That said, there is some nuance about what makes a series a series. If the nature of the way the data is collected is slightly different or the structure of the data is slightly different then maybe it's not part of the same series in the most strict technical sense, but from a more intuitive perspective it might still fit that description. More often what I see is something that is collected on an ongoing basis and released as monthly files. In that case the structure of the data is exactly the same and the files are only being released by the month to make them smaller or so that people don't have to download a larger annual file or whole master file just to get the most recent update. In that case, the "series" would easily fit the definition of a collection just being comprised of strict subsets of a master file. |
Got it. As @lilybradley mentioned, Dublin Core might have some useful features in this case: http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms |
I'm in favor of this as an optional field. It seems like next we need someone to articulate an exact vision for what this would actually look like in practice. |
A generalized A registry of Link Relationship Types is maintained by IANA. That registry includes the |
+1 on a 'link' or 'relatedResources' array property |
I'm going to suggest the Dublin Core
|
👍 for Other questions about collections:
Should we have a discussion about which schema elements need to be transmitted on the parent & on the child datasets? |
I think the I think some of the the explicitness of the metadata can be up to the discretion of the metadata publisher, but redundancy should be ok. My recommendation is that more attention should be given to the title and description on the parent, but since most of the resources will be listed on the children and since they could be accessed directly they should have meaningful titles and descriptions as well. The minimum required fields would be the same for both parent and child. Here's an example of what this would look like in a full data.json |
This is addressed in 8eb32ce |
realizing I'm late in the discussion, I suggest looking at the UML diagrams over at NOAA's Metadata Wiki and particularly the discussion of metadata hierarchies. |
@mhogeweg is there anything in particular that you think should be applied here? |
The diagrams on the NOAA site show various options for aggregating datasets in an abstract type 'DS_Aggregate' that has some defined subtypes that relfect different types of aggregations (production series, like the USGS quadsheets), sensors (like Landsat), and others. |
Thanks @mhogeweg. I think it might be best to start with something simple for now, but we can certainly look at making this more sophisticated in the future based on use and needs. Based on the proposal with Since we've addressed the most basic requirement, I think it's fair to say that we have provided guidance on defining simple collections and can close this. However, I realize there is interest in expanding this capability either for more complex hierarchies or for other kinds of relationships than the child/parent subset/master relationship I described above. I'd suggest using a separate issue either for the more complex collection hierarchy or for link relations - which would be a much broader set of use cases and might make more sense to apply to the distributions. For some other discussion of link relations, see #380 and #332 (comment) |
that sounds fine. in reality there are few use cases for collections within collections all the way down... |
Thanks everyone for working on this issue, including on the changes which have been accepted in the v1.1 update and merged into Project Open Data. Project Open Data is a living project though. Please continue any conversations around how the schema can be improved with new issues and pull requests! |
There was also interest in providing a field to reference a source dataset that a dataset was derived from. Since that's a separate use case from the primary way we've defined collections here, I've gone ahead and opened a new issue for that. See #393 |
Data.gov has had the notion of a "collection" that can be used to group multiple datasets that would logically be considered a single dataset, but have been released in separate parts. The most common scenario for this is a series of release over time. In some cases a dataset may by published in monthly or yearly releases, but if the only thing that distinguishes these is date, then they should really be packaged as a single dataset. This also makes browsing simpler - it prevents many similar datasets from crowding out more unique ones. Some datasets might also be published by location, such as data relating to each state being released as a separate file. These should also be grouped together to appear as a single dataset.
Ideally agencies should package these all together as a single file/release before publishing, eg one file that is continuously updated is preferable to separate releases over time, but at the very least there should be a way to define this kind of packaged grouping at the metadata level as is currently the case on data.gov.
The way data.gov handles this is that the collection is essentially treated just as a normal dataset entry but it refers to many child entries. Something similar could be done with the data.json schema, but we would need to establish a convention for defining that parent/child relationship between entries.
Here's a current example of a collection on data.gov
View of the collection "parent" metadata:
http://catalog.data.gov/dataset/tiger-line-shapefile-2010-series-information-file-for-the-2010-census-block-state-ba
View of all its "child" datasets:
http://catalog.data.gov/dataset?collection_package_id=2a8b7f0b-1ae5-453c-ba56-996547266a63
The text was updated successfully, but these errors were encountered: