Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metadata about the metadata record-- subjectOf, about.... #13

Closed
smrgeoinfo opened this issue Jul 14, 2024 · 9 comments
Closed

metadata about the metadata record-- subjectOf, about.... #13

smrgeoinfo opened this issue Jul 14, 2024 · 9 comments

Comments

@smrgeoinfo
Copy link
Collaborator

CDIF discovery document asserts A metadata record has two parts; one part is about the metadata record itself, the other part is the content about the resource that the metadata documents. The part about the record specifies the identifier for the metadata record, agents with responsibility for the record, when it was last updated, what specification or profiles the metadata serialization conforms to, and other optional properties of the metadata that are deemed useful.

Schema.org includes several properties that can be used to embed information about the metadata record in the resource metadata: sdDatePublished, sdLicense, sdPublisher, but lacks a way to provide an identifier for the metadata record distinct from the resource it describes, to specify other agents responsible for the metadata except the publisher, or to assert specification or profile conformance for the metadata record itself.

@smrgeoinfo
Copy link
Collaborator Author

smrgeoinfo commented Jul 14, 2024

There are two patterns that could be used to structure the two parts of the metadata record:

Option 1. The root object is the described resource:

{   "@context": "https://schema.org",
    "@id": "ex:URIforDescribedResource",
    "@type": "ImageObject",
    "name": "Picture of analytical setup",
    "description": "Description of the resource",
    "subjectOf": {
        "@id": "ex:URIforTheMetadata",
        "@type": "DigitalDocument",
        "dateModified": "2017-05-23",
        "encoding": {
            "@type": "MediaObject",
    	    "dcterms:conformsTo": {"@id":"ex:cdif-metadataSpec"}
          },
        "about":{"@id":"ex:URIforDescribedResource"}
    }  }

Option 2: root object is the metadata record

{   "@context": "https://schema.org",
    "@id": "ex:URIforTheMetadata",
    "@type": "DigitalDocument",
    "dateModified": "2017-05-23",
    "encoding": {
          "@type": "MediaObject",
    	  "dcterms:conformsTo": {"@id":"ex:cdif-metadataSpec"}
          },
    "about": {
         "@id": "ex:URIforDescribedResource",
         "@type": "ImageObject",
         "identifier":"identifier for thing in the world (e.g. doi)",
         "name": "Picture of analytical setup",
         "description": "Description of the resource",
         "subjectOf":{"@id":"ex:URIforTheMetadata"}
       }   }

The rdf triples generated by these two approaches are identical

@smrgeoinfo
Copy link
Collaborator Author

the metadata about the metadata is important in harvesting/federated catalog systems to keep track of where metadata came from, what format/profile it uses (harvesters need this to process), and update dates. Many people are using the approach with the root of the schema.org record with "@id": "ex:URIforDescribedResource" (first approach above); this is the information that goes into search indexes in general.

from that point of view the first approach makes more sense as it follows common practice. The pit fall is that the 'subjectOf' property is widely used for all kinds of things, so care is necessary to find the 'subjectOf' that provides the 'self' information about the metadata record. I suggest modifying the above encoding to include a description string that clearly identifies the subjectOf link to the metadata digital document.

{   "@context": "https://schema.org",
    "@id": "ex:URIforDescribedResource",
    "@type": "ImageObject",
    "name": "Picture of analytical setup",
    "description": "Description of the resource",
    "subjectOf": {
        "@id": "ex:URIforTheMetadata",
        "@type": "DigitalDocument",
        "dateModified": "2017-05-23",
        "description":"this metadata document",
        "encoding": {
            "@type": "MediaObject",
    	    "dcterms:conformsTo": {"@id":"ex:cdif-metadataSpec"}
          },
        "about":{"@id":"ex:URIforDescribedResource"}
    }  }

including the 'about' property with the back link to ex:URIforDescribedResource is useful but could be calculated with inverse of the subjectOf property.

@pbuttigieg
Copy link

pbuttigieg commented Jul 14, 2024

xref WorldFAIR D11.3, Section 1.2

In the JSON-LD specification , the @id is understood as a reference node in a graph, which has a value that leads to a serialisation of a metadata graph (i.e. the JSON-LD document itself). The example given is and @id value of http://me.markus-lanthaler.com/

That URL resolves to a JSON-LD document that (as of the timestamp on this comment), is:

{
  "@context": {
    "@vocab": "http://schema.org/",
    "image": {
      "@type": "@id"
    }
  },
  "@id": "http://me.markus-lanthaler.com/",
  "@type": "Person",
  "name": "Markus Lanthaler",
  "honorificPrefix": "Dr.",
  "image": "http://www.markus-lanthaler.com/images/markus-lanthaler.jpg",
  "url": "http://www.markus-lanthaler.com/",
  "nationality": {
    "@type": "Country",
    "name": "Italy"
  },
  "jobTitle": "Software Engineer",
  "affiliation": {
    "@type": "Organization",
    "name": "Google",
    "url": "http://www.google.com/"
  },
  "worksFor": {
    "@type": "Organization",
    "name": "Google",
    "url": "http://www.google.com/"
  },
  "sameAs": [
    "https://twitter.com/MarkusLanthaler",
    "https://www.google.com/+MarkusLanthaler",
    "https://www.linkedin.com/in/markuslanthaler"
  ]
}

This example uses schema.org semantics, and is typed as @Person. The @type and @id are JSON-LD parameters, that give way to the semantics of the @context once you're in the graph defined by the JSON-LD file. This is the approach that ODIS uses, which does not conform to either option above, and should be offered as another option.

To identify the resource described by the metadata, the schema:identifier property should be used. This is a lean way to discover and combine metadata graphs without either a) having to disentangle the identifier of the metadata from the entity described or b) having to - persistently, potentially hundreds of thousands of times - strip away content that is not about the enitites of interest (the DigitalDocument encapsulation in Option 2).

In the schema.org world, metadata about the metadata record itself can be provided by properties like schema:sdPublisher or using additonalProperty properties with PropertyValue types as values. However, in a recent CDIF WG call, @smrgeoinfo correctly raised that the use of schema.org's sd properties step out of the logic above: they are not about the typed entity described by the subgraph, but about the metadata record (the JSON-LD file, in this case). One has to "just know" that, as all other properites are about the typed entity described by the JSON-LD file.

This is resolvable by having another file (F2) containing metadata about the JSON-LD record above (F1). If one wants to serialise F2 in JSON-LD with schema.org semantics, one has to be a little careful.

  • If F2 is also understood as a metadata graph, typed as a Dataset, and uses a property like about to point to the @id of F1, that means that F1 is the subjectOf F2. This has the same issue as the "sd" properties, as subjectOf and about are referencing the entity described by F1, rather than F1 as a file in itself.
  • If, however, F2 is typed as a Dataset with its own @id, but an identifier with a value identical to the @id of F1, then I think we're consistent again.

F2 would look something like (with as many properities from Dataset as one needs; note values are fictional):

{
  "@context": {
    "@vocab": "http://schema.org/"
  },
  "@id": "http://metadata.about.me.markus-lanthaler.com/",
  "@type": "Dataset",
  "name": "Metadata about the Person, Markus Lanthaler, in JSON-LD",
  "identifier": "http://me.markus-lanthaler.com/",
  "encodingFormat": "application/json+ld",
  "dateCreated": "2024-02-13",
  "dateModified": "2024-05-23",
  "datePublished": "2024-05-23",,
  "creator": {
    "@type": "Organization",
    "name": "Some Org",
    "url": "http://www.someorg.com/"
  },
  "maintainer": {
    "@type": "Organization",
    "name": "Some Org",
    "url": "http://www.someorg.com/"
  }
}

Naturally, this could end up being metadata inception, so one has to draw the line on explicit representation of metadata about (meta)data somewhere and then rely on, e.g., a file system's native metadata functions.

@smrgeoinfo
Copy link
Collaborator Author

smrgeoinfo commented Jul 15, 2024

@pbuttigieg thanks for the response. I think the solution I suggested is pretty much equivalent to your F2, Except I am proposing to include it in line. I don't see a big problem for processors who don't care about it to ignore it.

There are couple issues in F2-- for instance it generates this triple:
<http://metadata.about.me.markus-lanthaler.com/> <http://schema.org/identifier> "http://me.markus-lanthaler.com/" .
which says that http://me.markus-lanthaler.com/ identifies a metadata record, when I think the intention is that it identify a person.

what is missing is the 'about' link from F2 to the node it describes which is the
"about":{"@id":"ex:URIforDescribedResource"}
statement in my suggestion. I'd expect
"about":{"@id":"http://me.markus-lanthaler.com/" in the JSON-LD

I was proposing a more specific statement of the encoding format, pointing to a specific profile. In the wild, it would probably be useful to make this an array including generic and specific formats along the lines of [json, json-ld, specificprofile]

I also propose that the metadata record is better represented as a "DigitalDocument" than "Dataset", since its a single digital object.

perhaps using
"name": "Metadata about ..." instead of "description": "Description of the resource" is a better approach. Thoughts?

aside:
what is a 'node'-- when I parse https://www.w3.org/TR/rdf11-concepts/#dfn-node literally, I end up with:
...a node is a subject or object... a subject is an IRI or a blank node; an object is an IRI, a literal or a blank node. Alternate interpretations -- is the node the IRI (string), blank node (digital object), literal (string), or the thing those symbols represent; does the IRI, blank node, or literal represent a thing in the world or a digital representation (one of many possible) for that thing? If we're not clear which interpretation is under discussion, misunderstanding results.

@pbuttigieg
Copy link

@pbuttigieg thanks for the response. I think the solution I suggested is pretty much equivalent to your F2, Except I am proposing to include it in line.

I'm not sure it is - the way the aboutness is handled is quite different.

The R sub-principles in the FAIR principles state that the metadata should persist after any data they describe are deleted. Thus ODIS will recommend keeping (meta)metadata separate.

I don't see a big problem for processors who don't care about it to ignore it.

I do. It seems like an entirely avoidable issue, and at scale it's many operations, thus avoidable energy use. At any rate, CDIF guidance shouldn't prescribe one or the other.

There are couple issues in F2-- for instance it generates this triple:
<http://metadata.about.me.markus-lanthaler.com/> <http://schema.org/identifier> "http://me.markus-lanthaler.com/" .
which says that http://me.markus-lanthaler.com/ identifies a metadata record, when I think the intention is that it identify a person.

I think that's consistent- the identifier value space is different from the @id value space . this is consistent with the RDF guidance you pasted below.

what is missing is the 'about' link from F2 to the node it describes which is the
"about":{"@id":"ex:URIforDescribedResource"}

As mentioned and explained above , I think this isn't correct. F2 isn't about the described resource.

statement in my suggestion. I'd expect
"about":{"@id":"http://me.markus-lanthaler.com/" in the JSON-LD

As above, I don't think that's right. F2 is about F1, not the thing F1 describes.

I was proposing a more specific statement of the encoding format, pointing to a specific profile. In the wild, it would probably be useful to make this an array including generic and specific formats along the lines of [json, json-ld, specificprofile]

I'm not sure what is meant here.

I also propose that the metadata record is better represented as a "DigitalDocument" than "Dataset", since its a single digital object.

Both work, Dataset is more useful and accurate IMO

perhaps using
"name": "Metadata about ..." instead of "description": "Description of the resource" is a better approach. Thoughts?

risky - names are fickle

aside:
what is a 'node'-- when I parse https://www.w3.org/TR/rdf11-concepts/#dfn-node literally, I end up with:
...a node is a subject or object... a subject is an IRI or a blank node; an object is an IRI, a literal or a blank node. Alternate interpretations -- is the node the IRI (string), blank node (digital object), literal (string), or the thing those symbols represent; does the IRI, blank node, or literal represent a thing in the world or a digital representation (one of many possible) for that thing? If we're not clear which interpretation is under discussion, misunderstanding results.

I'll think a bit more, but my first approximation is that it's a lot about the predicate.

@smrgeoinfo
Copy link
Collaborator Author

smrgeoinfo commented Jul 23, 2024

Next iteration. @id can identify a JSON-LD object or the thing that that object is about. It's ambiguous as defined in the various specs. Proposed CDIF solution:

Convention - include schema:identifier property that identifies a thing in the world that is the subject of a JSON-LD graph node. First guess default is then that @id identifies the 'representation' -- the JSON object that contains the @id element.

Longer explanation from draft CDIF handbook:

In a harvesting/federated catalog system some metadata about the metadata is important to keep track of where metadata came from, what format/profile it uses (harvesters need this to process), and update dates see Metadata Content Requirements. Unambiguous expression of this information requires making statements about a metadata record distinct from the thing in the world that the metadata describes (See Github issues 1,2 ). In an RDF framework, this requires a distinct identifier for the metadata record object that will serve as the subject for these triples.

Schema.org includes several properties that can be used to embed information about the metadata record in the resource metadata: sdDatePublished, sdLicense, sdPublisher, but lacks a way to provide an identifier for the metadata record distinct from the resource it describes, to specify other agents responsible for the metadata except the publisher, or to assert specification or profile conformance for the metadata record itself.

In the RDF serialization, Schema.org metadata records are JSON-LD node objects, and include an "@id" keyword with a value that identifies the node. This identifier can be interpreted to represent a thing in the world that the metadata record (the 'node') is about, or to represent the metadata record (a JSON object) itself. Here is a short example record (other '@' properties are explained below):

{   "@context": "https://schema.org",
    "@id": "ex:URIforResource",
    "name": "unique title for the resource",
    "description": "Description of the resource",
	"dateModified": "2017-05-23"
}

When this JSON-LD is converted to RDF triples (e.g. using the JSON-LD playground ), this results:

<ex:URIforResource> <http://schema.org/description> "Description of the resource" .
<ex:URIforResource> <http://schema.org/name> "unique title for the resource" .
<ex:URIforResource> <http://schema.org/dateModified> "2017-05-23"^^<http://schema.org/Date> .

The interpretation of the first two sets of triples would be that they are statements about the thing in the world that the metadata record is about. The third triple is ambiguous-- was the metadata content modified, or the described resource in the world? There does not seem to be any recognized best practice or consensus for dealing with this issue, so CDIF defines these conventions.

Use the schema.org identifier property to identify a thing in the world that is the subject of the JSON-LD node. The identified thing might be physical, imaginary, abstract, or a digital object. The JSON-LD @id property identifies a node in a graph, and can be interpreted in different ways; as a URI it is expected to dereference to produce the same JSON-LD object in which it is defined. Given this convention, when the metadata record is processed, the processor should use the schema:identifier as subject of triples about the subject of the metadata record to avoid ambiguity. In addition, this convention would suggest that if a schema:identifier property is present, the @id property should be interpreted to identify the JSON object that is the representation of the node in the knowledge graph.

Statements about the metadata record as a distinct entity should be made using a separate identified node object. This node object can be embedded in the metadata record about the resource in the world (Example 1 below), or published as a separate node (Example 2 below).

{   "@context": [
        "https://schema.org",
        {"dcterms": "http://purl.org/dc/terms/",
         "ex":"https://example.com/99152/"
        }
    ],
    "@id": "ex:URIforNode1",
    "@type": "appropriate schema.org type",
    "identifier":"ex:URIforDescribedResource",
    "name": "unique title for the resource",
    "description": "Description of the resource",
    "subjectOf": {
        "@id": "ex:URIforNode2",
        "@type": "DigitalDocument",
        "dateModified": "2017-05-23",
	"identifier":"ex:URIforNode1",
        "description":"metadata about documentation for ex:URIforDescribedResource",
    	"dcterms:conformsTo": {"@id":"ex:cdif-metadataSpec"}
	}        
   }

Example 1. Metadata about the metadata embedded.

{
    "@context": [
        "https://schema.org",
        {"ex": "https://example.com/99152/"}
    ],
    "@graph": [
        {
            "@id": "ex:URIforNode1",
            "@type": "Dataset",
            "identifier": "ex:URIforDescribedResource",
            "name": "unique title for the resource",
            "description": "Description of the resource"
        },
        {
            "@id": "ex:URIforNode2",
            "@type": "DigitalDocument",
            "dateModified": "2017-05-23",
            "identifier": "ex:URIforNode1",
            "description": "metadata about documentation for ex:URIforDescribedResource",
            "dcterms:conformsTo": {"@id": "ex:cdif-metadataSpec"}
        }
    ]
}

Example 2. Metadata about metadata as a separate graph node.

Including the schema:description with the string "metadata about documentation for ex:URIforDescribedResource" will allow disambiguating different usages of the subjectOf property. The ex namespace in the example above is only included so the example is valid; actual metadata would likely have its own namespace for resource and metadata URIs. The distinct identifier for the metadata record (ex:URIforNode1) allows statements to be made about the metadata separately from statements about the resource it describes.

@smrgeoinfo
Copy link
Collaborator Author

smrgeoinfo commented Jul 25, 2024

another possible solution:

{
    "@context": [
        "https://schema.org",
        {"ex": "https://example.com/99152/"}
    ],
    "@graph": [
        {
            "@id": "ex:URIforDescribedResource",
            "@type": "Dataset",
            "name": "unique title for the resource",
            "description": "Description of the resource"
        },
        {
            "@id": "ex:URIforNode2",
            "@type": "DigitalDocument",
            "dateModified": "2017-05-23",
            "url": "ex:URIforDescribedResource",
            "description": "metadata about documentation for ex:URIforDescribedResource",
            "dcterms:conformsTo": {"@id": "ex:cdif-metadataSpec"}
        }
    ]
}

The metadata record can use @id with identifier for the described resource, so the generated triples with @id make sense. The node with information about the metadata record links to its target metadata using the sdo:url property, under the interpretation that dereferencing a node identifier should return the JSON object that has that @id. Seems less divergent with common usages that the sdo:identifier approach suggested above.

better yet instead of "url": ...

"about":{
     "@type":"DigitalDocument",
     "url":"ex:URIforDescribedResource"} 

seems clearer to me.

including an "additionalType":"ex:metadataDocumentation" or something like that in the metadata about metadata node would also help clarify things.

@smrgeoinfo
Copy link
Collaborator Author

smrgeoinfo commented Sep 16, 2024

suggested solution

{
    "@context": [
        "https://schema.org",
        {"ex": "https://example.com/99152/"}
    ],
    "@graph": [
        {
            "@id": "ex:URIforDescribedResource",
            "@type": "Dataset",
            "name": "unique title for the resource",
            "description": "Description of the resource"
             .... a bunch of other stuff describing the resource...
        },
        {
            "@id": "ex:URIforNode2",
            "@type": "DigitalDocument",
            "dateModified": "2017-05-23",
            "about":{
                     "@type":"DigitalDocument",
                       "url":"ex:URIforDescribedResource"} ,
            "description": "metadata about documentation for ex:URIforDescribedResource",
            "dcterms:conformsTo": {"@id": "ex:cdif-metadataSpec"}
        }
    ]
}

the "@id": "ex:URIforNode2" is analogous to dcat:CatalogRecord.

This approach avoids allows constructing rdf triples from the JSON-LD according to the standard practice (using the @id as the subject) instead of having to use the sdo:identifier as the subject.

@smrgeoinfo
Copy link
Collaborator Author

smrgeoinfo commented Oct 18, 2024

Decision from discussions at Dagstuhl 2024-10:
keep metadata about metadata in separate graph (has own @id)
@id identifies graph node-- an abstract data object, analogous to the primary key in a relational data table. The schema:identifier identifies the actual resource that is the subject of the information in the graph node.

Decision is to recommend serialization like this:

{"@context": [
        "https://schema.org",
        {"dcterms": "http://purl.org/dc/terms/",
         "ex":"https://example.com/99152/"       }
    ],
 "@id": "ex:URIforNode2246",
  "@type": ["Dataset"],
  "identifier": "ex:URIforTheDataset",
  "name": "unique title for the resource",
  "description": "Description of the resource",
.... other properties of the resource...
  "subjectOf": {
"@id": "ex:URIforTheDigitalObjectMetadata",
       	 "@type": "DigitalDocument",
       	 "identifier": "ex:URIforNode2246",
        	"dcterms:conformsTo": "ex:cdif_SDO_profile_uri",
       	 "dateModified": "2014-02-23"
  	  }
}

or the equivalent serialization as a graph with two distinct nodes:

{
    "@context": [
        "https://schema.org",
        {"ex": "https://example.com/99152/"}
    ],
    "@graph": [
        {
            "@id": "ex:URIforNode2246",
            "@type": "Dataset",
            "identifier": "ex:URIforDescribedResource",
            "name": "unique title for the resource",
            "description": "Description of the resource"
        },
        {
            "@id": "ex:URIforTheDigitalObjectMetadata",
            "@type": "DigitalDocument",
            "dateModified": "2017-05-23",
            "identifier": "ex:URIforNode2246",
            "description": "metadata about documentation for ex:URIforDescribedResource",
            "dcterms:conformsTo": "ex:cdif-metadataSpec"        }
    ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants