Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

Ability to specify primary or alternate metadata source #308

Open
philipashlock opened this issue May 8, 2014 · 9 comments
Open

Ability to specify primary or alternate metadata source #308

philipashlock opened this issue May 8, 2014 · 9 comments

Comments

@philipashlock
Copy link
Contributor

In most cases the data.json file is the only available public representation of the metadata for a record, but there are some instances where there's already a well established version of this metadata that would make sense to designate as the primary source. This would frequently be applicable for geospatial data that already has a well established and robust representation of its metadata.

There's a great benefit to the data.json file providing a comprehensive and unified view of all records for an agency, but it's also worth recognizing that there will be situations where it's not of providing the best or most complete metadata for a particular kind of dataset. In these cases it would be great if the data.json was able to reference those other metadata sources.

perhaps something like:

"metadata_sources": [
    {
        "primary": true, 
        "source": "http://gis.agency.gov/metadata/", 
        "source_type": "waf",
        "schema": "ISO-19139"
    }
]
@philipashlock philipashlock added this to the Next Version of Common Core Metadata Schema milestone May 8, 2014
@mhogeweg
Copy link
Contributor

mhogeweg commented May 9, 2014

Being a geonerd, I'd posit that more than 80% of data is geospatial and that thus the vast majority will have verbose metadata in FGDC or ISO format. just looking at data.gov and the data that's there confirms that.

we have been using the accessURL element with format: text/xml to reference this XML-based metadata just because it's there and there was so far no other DCAT element for it

...
"distribution" : [
  {
    "accessURL" : "http:\/\/gptogc.esri.com\/geoportal\/rest\/document?id=%7BB359142D-11A3-4E20-BBCC-DE17E413C699%7D",
    "format" : "text\/xml"
  },
...

Your example can hold 1000s of records actually and don't need all be in ISO19139 schema, even coming from the same agency. was that your intent with a single json element?

This also brings back up the discussion about CSW. CKAN can now apparently harvest CSW. Why not allow agencies to register their CSW with Data.gov instead of first exporting to DCAT then getting metadata XML through this suggested metadata_resources element?

@smrgeoinfo
Copy link
Contributor

In a federated system in which metadata is being harvested from different catalogs into aggregators (e.g. data.gov), it seems obvious that the aggregator should maintain a link to the harvest source. Implementations like ESRI Geoportal and GeoNetwork do this 'under the covers'. Unfortunately the major metadata interchange formats (19139, FGDC XML) don't have an obvious element that provides a link to the original source of a record. ISO 19115/19139 does have the concept of an identifier for the metadata record (gmd:fileIdentifier), which if it were a resolvable URI could provide such a link, and a metadata contact element that should identify the steward of the original metadata.

CKAN (with pyCSW) can harvest from a CSW, but I think still needs a adapter to map harvested content into the DCAT model; this shouldn't be too hard since DCAT is mostly the DC elements that are already CSW core queryable/deliverable elements. I guess the big question is who does the mapping from their existing metadata to DCAT-- the metadata originator or the DCAT harvester.

@philipashlock
Copy link
Contributor Author

To respond to that last question - Data.gov can help in this effort and once harvested we do maintain a link to all the harvest sources, but ultimately it's each agency's responsibility under Project Open Data (M-1313) to ensure that all their public datasets (which should would include any existing metadata) are represented in the DCAT-based data.json file. The idea is to build the capability to manage a unified comprehensive data inventory within each agency rather than after the fact by a third-party harvester.

I think denoting all the harvest sources and schemas at a granular level across the whole agency would also provide a nice inventory to review legacy harvest sources and deprecated metadata schemas.

@mhogeweg
Copy link
Contributor

mhogeweg commented May 9, 2014

ISO left a huge opportunity open by not making the file identifier mandatory and unique. The systems like geoportal and geonetwork work around this.

My understanding is that pycsw takes the full XML and transforms it to ISO 19139 encoding. It would be good if the transformations are published on the data.gov site. Is 19115 used? 19119? 19110? Etc. this mapping is not trivial and will requires certain choices, given the differences between FGDC, ISO, and other metadata schemas. How are unique identifiers assigned in the pycsw vs ckan vs source metadata?

There are mappings already on project open data between metadata schemas, but the above questions on identifiers aren't addressed there.

If you're looking to build these references between things managed in loosely coupled systems (for provenance or other semantic goals), agreement on those identifiers become really important.

M

@philipashlock
Copy link
Contributor Author

Why not allow agencies to register their CSW with Data.gov instead of first exporting to DCAT then getting metadata XML through this suggested metadata_resources element?

@mhogeweg: In situations where this makes sense (which is probably most of the time for geospatial data with existing metadata) this would be the approach, but since agencies are already required to provide all their records in the DCAT based data.json file this would provide a way to flag that there is better metadata and the data.json harvester could skip these records and defer to the other harvest sources

@philipashlock philipashlock modified the milestone: Next Version of Common Core Metadata Schema (1.0 -> 1.1.) Jul 24, 2014
@gbinal gbinal removed this from the Next Version of Common Core Metadata Schema (1.0 -> 1.1.) milestone Jul 24, 2014
@rebeccawilliams
Copy link
Contributor

Adding the Geospatial tag for forthcoming mapping updates. Does the new conformsTo field (via #309 and the v 1.1 updates) address these other pain points or should this remain open @philipashlock?

@philipashlock
Copy link
Contributor Author

@philarcher1 Do you have any suggestions on how we might link to source metadata like ISO 19115 when the same metadata is being represented as DCAT?

@andrea-perego
Copy link

Sorry to jump in your conversation. I am not a member of this group, and I hope you would forgive my intrusion.

@philarcher1 kindly made me aware of the issue you're discussing, which is definitely relevant to some work we have been doing in the EU on the RDF representation of INSPIRE metadata (an ISO 19115 profile).

Actually, in our specs we have not explicitly included a link to the original metadata record, but this is an additional feature we are considering.

The option we are currently in favour of is to use dcat:CatalogRecord + dct:source. Adapting one of the DCAT examples:

:catalog  dcat:record :record-001  .
:record-001
       a dcat:CatalogRecord ;
       foaf:primaryTopic :dataset-001 ;
       dct:source <http://gptogc.esri.com/geoportal/rest/document?id=%7BB359142D-11A3-4E20-BBCC-DE17E413C699%7D> . 

Does this make sense to you?

PS: I would be happy to provide more details on the work concerning the RDF representation of INSPIRE metadata, if this is of interest. And in any case your feedback would be more than welcome!

@rebeccawilliams
Copy link
Contributor

See also source as suggested in #393

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants