Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

Guidance on defining collections to group datasets released as a series or in fragments #258

Closed
philipashlock opened this issue Jan 27, 2014 · 37 comments

Comments

@philipashlock
Copy link
Contributor

Data.gov has had the notion of a "collection" that can be used to group multiple datasets that would logically be considered a single dataset, but have been released in separate parts. The most common scenario for this is a series of release over time. In some cases a dataset may by published in monthly or yearly releases, but if the only thing that distinguishes these is date, then they should really be packaged as a single dataset. This also makes browsing simpler - it prevents many similar datasets from crowding out more unique ones. Some datasets might also be published by location, such as data relating to each state being released as a separate file. These should also be grouped together to appear as a single dataset.

Ideally agencies should package these all together as a single file/release before publishing, eg one file that is continuously updated is preferable to separate releases over time, but at the very least there should be a way to define this kind of packaged grouping at the metadata level as is currently the case on data.gov.

The way data.gov handles this is that the collection is essentially treated just as a normal dataset entry but it refers to many child entries. Something similar could be done with the data.json schema, but we would need to establish a convention for defining that parent/child relationship between entries.

Here's a current example of a collection on data.gov

View of the collection "parent" metadata:
http://catalog.data.gov/dataset/tiger-line-shapefile-2010-series-information-file-for-the-2010-census-block-state-ba

View of all its "child" datasets:
http://catalog.data.gov/dataset?collection_package_id=2a8b7f0b-1ae5-453c-ba56-996547266a63

@cew821
Copy link

cew821 commented Jan 28, 2014

👍 Yes this is totally needed. We have a number of datasets at Energy that we would like to consolidate into a single listing rather than have 100s of entries for each year x state.

It sounds like the parent/child "collection" concept used by data.gov is somewhat different than the entry/distribution of datasets currently used by the schema. Should the guidance direct people to put "children" datasets in the distribution array, or is something else needed?

@dsmorgan77
Copy link
Contributor

Collections are absolutely needed. I could argue either way on the ideal publication path (many of DOT's (current) data customers are states or cities who just aren't interested in downloading the entire Nation's data file and filtering out their information ... we should serve both).

The problems with this are that certain properties will need to filter down to the data file itself. A collection may have a temporal coverage of 1975-present, but an individual file may cover only a single year. A collection may have a geographic coverage of "United States" but a single file may have a geographic coverage of "Alabama." Download URLs will be on a file-by-file basis. Formats might change over time. Data dictionaries may change over time as data elements might be changed.

Clear examples where the collections concept is needed include:

I would highly recommend coordinating with the Federal statistical community on how collections might support them. Groups such as the Statistical Community of Practice and Engagement (SCOPE) will have helpful suggestions on how best to implement.

@philipashlock
Copy link
Contributor Author

@dsmorgan77 the problems you described seemed to be well served by the model of having a child parent relationship where both the child and the parent would essentially be a fully qualified entry in a data.json. Does that answer your question @cew821?

So no, the children wouldn't be listed under the distribution. Instead the parent wouldn't list any distributions, but it might have a flag indicating that it's a collection, perhaps "collection":true or something and then each child would point to the parent with something like "collectionID":"uniqueid-12345"

Dan raises a good point about being able to be more precise about the temporal bounds for each individual file if something is released in fragments by region, but I think it would still be acceptable to package all those together as distribution under one dataset. This would still allow people to download individual files. You could argue that any dataset could be sliced into smaller pieces and create metadata for it, but some things are already logically packaged as a dataset, so I don't know that we really always need to create extra metadata for subsets like that.

@cew821
Copy link

cew821 commented Jan 28, 2014

Yes that makes sense. I think your proposal for additional optional data field indicating the parent makes sense (at least on the child record). I would think you would want to use identifier as the foreign key for the collectionID field (or maybe parentID would be a better name?). I'm not sure you would need to add a field to the "parent" record, but I suppose having some indicator that there are children records to go look for could be useful.

Adopting this approach would also open the possibility for nested parent/child relationships, which I think should be fine (i.e. a record could both be a child of a parent record, and itself be the parent of children records).

@lilybradley
Copy link

This is great. I like how the SKOS core guide deals with the "collection" issue: http://www.w3.org/TR/2005/WD-swbp-skos-core-guide-20051102/ - using broader/narrower/associated relationships and meaningful collections of concepts.

Works well with the standards that POD schema already employ.

In case you haven't seen it, HHS is using these standards for its catalog data.jsonld implementation:
"@context": {
"rdfs": "http://www.w3.org/2000/01/rdf-schema#",
"dcterms": "http://purl.org/dc/terms/",
"dcat": "http://www.w3.org/ns/dcat#",
"foaf": "http://xmlns.com/foaf/0.1/",
"pod": "http://project-open-data.github.io/schema/2013-09-20_1.0#"
},

@raking08
Copy link

The parent child issue also presents a different problem for my agency. In our case the parent dataset is often not releasable (PII or other reasons) so only the "derivative" or child data set is released. We are still obligated to have the parent/internal only dataset in the directory. The child is often not even stored on the same system (we have hundreds of custom systems for collecting and storing datasets). We are planning on using the unique identifier of the parent as an attribute of the derivative (open) dataset, but are only now considering whether that's looks like a concatenation of the parent U.I + a serialization number or if its two UIs(either way displaying as a collection on data,gov seems straightforward) As our internal catalogue will be in the form of a database that will transmit a file in the correct format, I am wondering how best to construct it in order to minimize the record maintenance. If its a full record, then there will be much redundancy of input which is a source of errors and loss of linkage to say nothing of data steward resistance to double inputs. As we consider the parent child issues is this a concern of anyone? This is not a big deal if 100 datasets, but we have estimated 8-10,000. Has anyone sorted this out?

@lilybradley
Copy link

Ranking08, could we solve that kind of problem with entity relationship diagrams (that we could somehow translate into a flat file)? ERDs are a lot like linked data. Similar solution for different problems.

Also NIST is building the NIEM. Your agency might be a good candidate for them as they move into beta.

@JoshData
Copy link
Contributor

+1 for collections.

In HHS's DMS, each dataset is optionally given a Group Name, which is a string. Any datasets with the same group name are organized together in search results. We can map group names to collections in our data.json file, but we don't have any additional metadata for the group itself so we'd have to make up metadata for the parent dataset.

I think it might be simpler to put collection: ['childID', 'childID'] on the parent, and then there's no need for is_collection: true on the parent.

@raking08
Copy link

Thanks Josh.. one point in our case is the parent (for instance say it has PII) in most cases, will not be exposed on data.gov , or any other outward facing site, only the child... so the parent record will be in the internal "all dataset" catalogue and not in the public JSON file. Still, the children should all sort together .. so in this case the relationship should be maintained at the child record and not on the parent record... thoughts?

@JoshData
Copy link
Contributor

@raking08: The parent could be a placeholder, I guess? It would be odd to have an identifier to something but the something isn't listed in the file. (Even if it has PII, it could still be listed but with accessLevel=restricted or nonpublic I think?)

@raking08
Copy link

interesting thought.. but there is significant resistance to having any exposure of the parent ( consider the rest of the metadata that would be exposed) so I do not think listing the parent would be allowed, but I will pose that to security. But as long as the children would be exposed with their different tags for spatial and temporal and any other specifics then it should be OK.. the placeholder could be interesting if there were say 100 children, but you would still need to be able to drill into the placeholder to get to the specific dataset you wanted. How do you see that working?

@cew821
Copy link

cew821 commented Jan 30, 2014

@JoshData I think the children need to have their "parentID" (i.e. the foreign key) in their record, since it's a one -> many relationship between parent and children, no? I suppose you could also include a list of all the children in a collection: [child1_id, child2_id] array in the parent record, as a convenience, but I would definitely expect the children records had a parent: parent_id field.

@cew821
Copy link

cew821 commented Jan 30, 2014

@raking08 I agree with the placeholder idea - even if all the record contained was a title, unique_id that matched with the children's parent_id and a accessLevel: non-public field, that would probably be enough for the public to understand that the children datasets were a public subset of this larger, non-public dataset.

@raking08
Copy link

Hi CEW,
Those are my thoughts too, in fact I am modeling this in separate tables so that the child can inherit characteristics of the parent without needing double data maintenance, but still output the flat file to the public exposure site. Due to additional business requirements for data lifecycle management, we may have an additional layer below the child as well in the internal repository, but it wont be public site relevant. I would be very curious to know if anyone has an internal system that combines managing the core metadata for all data sets ( internal and external) together with data lifecycle management functions.

@dsmorgan77
Copy link
Contributor

@raking08 I don't see the problem with your particular instance of mentioning a dataset that has PII in your public data listing. I know it's optional, but the public already has notice that there is a system that collects data containing PII. How? Because your agency is publishing Privacy Impact Assessments and System of Records Notices telling them just that. In short, the public already knows.

And, the notion that the minimum required metadata is somehow "too much" metadata is kind of ludicrous. The title, description, and contact point (for a privacy-sensitive dataset, that'd just be the redress point of contact), and agency identifiers are all innocuous and, again, already made public when you release a PIA and a SORN.

@raking08
Copy link

Dear dsmorgan77,
Thank you for your input; however, I would caution you to not judge others so quickly when you do not have the specifics.
While I don't disagree with your initial assertion that the public knows, it is not my decision (nor yours) if this metadata is to be exposed. It has been made clear to me that certain internal parent datasets will not be exposed even for the minimal metadata, but their children will be. In fact, they do have a very reasonable rational for this position but that is not for discussion in this forum.
Others are pointing to a workable solution which I appreciate.

@lilybradley
Copy link

@raking08 and other interested parties, it could be helpful to have an offline discussion about the enterprise inventory and metadata for agencies that are facing higher barriers to exposing their metadata for restricted datasets. It would be helpful to discuss good/best practices, lessons learned, etc. Ping me (first.last@hhs.gov) via your .gov email address, if you would like to join an informal conf call discussion.

@ghost
Copy link

ghost commented Feb 3, 2014

We've been working on this for a little while and have determined that parent-child relationships force the user to navigate in a hierarchical fashion when often data is related in a multidimensional 'hyper-cube' of attributes / parameters. In other words, I might be interested in other data based on any one of a number of dimensions present in that dataset (e.g., who: subject, where: location, when: years covered, etc.).

My personal take away from all the metadata modeling work that I've been doing for the past year is that it's in order to develop / discover a common metadata standard (potentially versioned to accommodate change) that can adequately describe all relevant datasets so that a single nuanced difference (e.g., as described by Philip at the beginning of this thread) can serve to both filter out all non-relevant datasets while grouping the most closely related datasets in a manner that enables the relationship to be implied by the proximity of two or more datasets to each other in the results of a given query.

@philipashlock
Copy link
Contributor Author

It sounds like there might be broader interpretations of a collection than what I originally had in mind.

What prompted this (and what might be a more solvable initial scope to approach here) are collections that are made up of nothing more than subsets of a master dataset. This means there are no transformations or other alterations going from the parent to the child other than excluding the rows that don't fit into the subset. In other words, the parent and children should have the same schema and column headings and you should be able to aggregate or concatenate all children to represent the parent.

I don't think this has to be a strict requirement, but it would be helpful to be clear and consistent about what's implied by the relationship of collections.

Some examples of this are the portions of TIGER Line data that are not available as a national file and require you to download all subsets and merge them together if you want national coverage.

You could also argue that TIGER data is released as a time series with each date of collection and publication so it would also make sense to put each year as a child of a parent for all historical TIGER data. While this would extend the family tree another level, I think you could still indicate that relationship without it becoming a strict hierarchy that users are forced to navigate.

If you were to aggregate all releases of TIGER data as a time series it's pretty clear that you couldn't aggregate those all into one master file because of changes in the schema and format of the data from year to year, but I think it would still be valuable to indicate they're all part of the same overall collection.

One way to distinguish between these different ways of defining collections would be that if the collection is comprised of nothing more than subsets of a master dataset, than the parent of the collection should also provide a merged file of everything in the collection as a listing in the distribution

Although it would probably be better to be explicit about the relationship if we're allowing a variety of different kinds.

@ghost
Copy link

ghost commented Feb 3, 2014

Ah, I see. So, in this case, would we be talking about a series rather than a parent:child relationship?

@philipashlock
Copy link
Contributor Author

To date I think the only way we've discussed identifying something as being part of a series is here in this issue as being part of a collection, so yes, as a parent:child relationship.

That said, there is some nuance about what makes a series a series. If the nature of the way the data is collected is slightly different or the structure of the data is slightly different then maybe it's not part of the same series in the most strict technical sense, but from a more intuitive perspective it might still fit that description.

More often what I see is something that is collected on an ongoing basis and released as monthly files. In that case the structure of the data is exactly the same and the files are only being released by the month to make them smaller or so that people don't have to download a larger annual file or whole master file just to get the most recent update. In that case, the "series" would easily fit the definition of a collection just being comprised of strict subsets of a master file.

@ghost
Copy link

ghost commented Feb 4, 2014

Got it. As @lilybradley mentioned, Dublin Core might have some useful features in this case:

http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms

@lilybradley
Copy link

More specifically, I find the diagrams below helpful. While I understand it is not a currently practical priority, these diagrams also convey how data.gov will/could interact with the semantic web via DBpedia.

ex-lab-col-rel
ex-ord-col

@philipashlock philipashlock added this to the Next Version of Common Core Metadata Schema milestone Apr 14, 2014
@philipashlock philipashlock modified the milestone: Next Version of Common Core Metadata Schema (1.0 -> 1.1.) Jul 24, 2014
@gbinal
Copy link
Contributor

gbinal commented Jul 24, 2014

I'm in favor of this as an optional field. It seems like next we need someone to articulate an exact vision for what this would actually look like in practice.

@katucker
Copy link

A generalized link array property could handle the simple parent-child relationship but retain the flexibility to support other relationships between datasets. RFC 5988 describes the link property from an HTML, HTTP and Atom perspective. A similar approach could be used in the Common Core Metadata Schema.

A registry of Link Relationship Types is maintained by IANA. That registry includes the item relation type for a parent dataset to reference all the children, and the up relation type to reference the parent from the child. It also contains next, prev or previous, start, first, and last relation types to navigate datasets at the child level.

@smrgeoinfo
Copy link
Contributor

+1 on a 'link' or 'relatedResources' array property

@philipashlock
Copy link
Contributor Author

I'm going to suggest the Dublin Core isPartOf property on each child dataset referencing the identifier of the parent. This property is also used by schema.org. Dublin Core defines it as:

A related resource in which the described resource is physically or logically included.

isPartOf would only be used for datasets that are subsets of the larger collection. For anything that is derived or transformed from another source dataset, I would recommend the Dublin Core source property referencing the identifier of the source. Dublin Core defines it as:

The described resource may be derived from the related resource in whole or in part. Recommended best practice is to identify the related resource by means of a string conforming to a formal identification system.

@dsmorgan77
Copy link
Contributor

👍 for isPartOf ... do we need to have the hasPart property on the parent dataset?

Other questions about collections:

  • In a collection, certain properties will be a little repetitive (title, description, keywords, publisher, contact information, bureau code, program code, license). Should these properties live at the parent and be inherited by the child datasets?
  • Likewise, some child dataset properties could roll up the parent (earliest temporal reference & latest temporal reference in child datasets could be rolled up to parent dataset temporal property).

Should we have a discussion about which schema elements need to be transmitted on the parent & on the child datasets?

@philipashlock
Copy link
Contributor Author

I think the isPartOf on the child will be sufficient as an additional property, but the description of the parent should make it clear that it's a collection.

I think some of the the explicitness of the metadata can be up to the discretion of the metadata publisher, but redundancy should be ok. My recommendation is that more attention should be given to the title and description on the parent, but since most of the resources will be listed on the children and since they could be accessed directly they should have meaningful titles and descriptions as well.

The minimum required fields would be the same for both parent and child.

Here's an example of what this would look like in a full data.json

@gbinal
Copy link
Contributor

gbinal commented Sep 4, 2014

This is addressed in 8eb32ce

@mhogeweg
Copy link
Contributor

realizing I'm late in the discussion, I suggest looking at the UML diagrams over at NOAA's Metadata Wiki and particularly the discussion of metadata hierarchies.

@philipashlock
Copy link
Contributor Author

@mhogeweg is there anything in particular that you think should be applied here?

@mhogeweg
Copy link
Contributor

mhogeweg commented Nov 4, 2014

The diagrams on the NOAA site show various options for aggregating datasets in an abstract type 'DS_Aggregate' that has some defined subtypes that relfect different types of aggregations (production series, like the USGS quadsheets), sensors (like Landsat), and others.

@philipashlock
Copy link
Contributor Author

Thanks @mhogeweg. I think it might be best to start with something simple for now, but we can certainly look at making this more sophisticated in the future based on use and needs. Based on the proposal with isPartOf, the model does allow a lot of flexibility in terms of different groupings within a single collection, even collections within collections, but currently data.gov only displays one level of collections - it can't show a collection within a collection. I think we can definitely look at expanding that functionality in the future, but we wanted to have something simple to start off with that will meet most people's needs and I think this does that.

Since we've addressed the most basic requirement, I think it's fair to say that we have provided guidance on defining simple collections and can close this. However, I realize there is interest in expanding this capability either for more complex hierarchies or for other kinds of relationships than the child/parent subset/master relationship I described above.

I'd suggest using a separate issue either for the more complex collection hierarchy or for link relations - which would be a much broader set of use cases and might make more sense to apply to the distributions. For some other discussion of link relations, see #380 and #332 (comment)

@mhogeweg
Copy link
Contributor

mhogeweg commented Nov 7, 2014

that sounds fine. in reality there are few use cases for collections within collections all the way down...

@gbinal
Copy link
Contributor

gbinal commented Nov 10, 2014

Thanks everyone for working on this issue, including on the changes which have been accepted in the v1.1 update and merged into Project Open Data. Project Open Data is a living project though. Please continue any conversations around how the schema can be improved with new issues and pull requests!

@philipashlock
Copy link
Contributor Author

There was also interest in providing a field to reference a source dataset that a dataset was derived from. Since that's a separate use case from the primary way we've defined collections here, I've gone ahead and opened a new issue for that. See #393

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

11 participants