Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demonstration of a SKOS-backed semi-open vocabulary for MIME types #363

Open
6 of 11 tasks
plbt5 opened this issue Apr 15, 2022 · 9 comments · May be fixed by #377
Open
6 of 11 tasks

Demonstration of a SKOS-backed semi-open vocabulary for MIME types #363

plbt5 opened this issue Apr 15, 2022 · 9 comments · May be fixed by #377
Assignees
Milestone

Comments

@plbt5
Copy link
Contributor

plbt5 commented Apr 15, 2022

Background

In UCO, many types are identified by a string as opposed to a thing, i.e., a IRI-backed node in a graph. The advantage of the former is that a new type-by-string is easy to create when the particular type is missing from the ontology. The disadvantages are:

  1. strings are not first-class citizens in the ontology world, as opposed to IRI-backed things,
  2. consequently, subsequent code must be added to check whether the applied string equals the predefined type-by-string
  3. the approach is rather error-prone on string typo's.

The advantage of the latter, i.e., type-by-IRI, are the opposite of the former type-by-string's disadvantages. At the same time, type-by-IRI has a disadvantage of its own, being that an upgrade of the available types with a new one requires knowledge of RDF(s), OWL, and/or the data model or ontology that define the other individuals being supplemented.

The purpose of this issue is to lay the foundation that is necessary to gain data/experience with users adding a type-by-IRI in UCO.

Objective / Purpose

The purpose of this issue is to lay the foundation that is necessary to gain data/experience with issues that users might run into when adding a type-by-IRI in UCO.

Requirements

Requirement 1

UCO shall have access to a SKOS-vocabulary that specifies individuals to represent each and every mime-type as defined by the IANA Media Types registry, in order to use these individuals to specify the type of a medium registered in UCO.

Requirement 2

The resulting taxonomy shall align with the standard two-tier scheme as defined by the IANA Media Type Registry:

  • Main tier: Media (Concept) Types, such as application, image, etc.
  • Secondary tier: Media Subtype, such as application/zip, image/gif, etc.

Requirement 3

The SKOS-vocabulary shall be serialised in Turtle.

Requirement 4

Loosely-coupled: Any modification to the SKOS-vocabulary shall not imply a change to the UCO-ontology.

Requirement 5

Manageability: Any modification to the IANA Media Type Registry shall effect an update to the SKOS-vocabulary, preferrably mechanically.

Requirement 6

Continuity & maintainability: Any modification to the SKOS-vocabulary shall result in a new version.

Requirement 7

Provenance: Any Media (Content) Type or Subtype added to the SKOS-vocabulary that originates from the UCO or CASE community, shall be categorised as such. This implies that for any Media (Content) Type / Media Subtype pair that exists, its provenance is maintained.

Risk / Benefit analysis

Benefits

The benefit of the stated objective is that data about, and experience from, users adding type-by-iri to the vocabulary become available. It is then possible to investigate how to improve the user acceptance and minimise their technical knowledge required for adding a new type in this way.

The benefit sof having a vocabulary about IANA Media Types available, are:

  1. Using these concepts (individuals) as unequivocal types in UCO;
  2. Becoming interoperable with Dublin Core users, particularly those that employ http://purl.org/dc/terms/MediaType in their graph design.

Risks

Except in relation to the semi-openess of the vocabulary, the submitter is unaware of risks associated with this change.

Consequences

The intention of theis CR is that the type-by-string design will be replaced by a type-by-IRI design. The consequences that are foreseen, are (not necessarily comprehensively) as follows:

  1. A potential impact on the design of UCO on observable:mimeType to become an owl:ObjectProperty.
  2. A potential breaking change (i.e., not backwards compatible) between the current version and the version implementing this CR.

Competencies demonstrated

Competency 1

  • CQ: What Media Types are specified in the ontology?
  • Result: All Media Types that are specified by IANA Media Type Registry.

Competency 2

  • CQ: What Media Types carry <substring> (string as ordered characters) somewhere in their name, abbreviation or description? No constraints apply for the amount of characters used in the substring; the search is agnostic for diacritical characters, i.e., an a in the substring finds ā, ă, ä and similar characters.
  • Result: Including their meta-data, only Media Types and Subtypes from the IANA Media Type Registry are returned, the name, abbreviation or description of which contain the substring.

Competency 3

Competency 4

As security service provider, I want to reference application/tar, and I don't care whether it is a IANA media type or not. I've always said application/tar, it's been coded like that in my product for a decade, and my customers know I mean 'tape archive' when I say that.

  • CQ: Similar to Competencies 1, 2 and 3, however, now also allowing for Media (Content) Types and Subtypes that originate from the UCO and CASE community / tool providers.
  • Result: Similar to Competencies 1, 2 and 3, however, now limited to Media (Content) Types and Subtypes that originate from the UCO and CASE community / tool providers.

Competency 5

  • CQ: Which [Media (Content) Types | Media Subtypes] belong to [uco-something:IANAMediaType | uco-something:NonIANAMediaType]?
  • Result: A list of individuals of type [Media (Content) Types | Media Subtypes] that, according to specification, belong to [uco-something:IANAMediaType | uco-something:NonIANAMediaType]

Competency 6

  • CQ: Does <specific media (sub)type> belong to uco-something:IANAMediaType or uco-something:NonIANAMediaType?
  • Result: Either uco-something:IANAMediaType or uco-something:NonIANAMediaType, according to specification.

Solution suggestion

The taxonomy converts the IANA Media Types registry into SKOS under a UCO namespace, following a mostly two-tier skos:ConceptScheme:

  • The top-level concepts are the so-called Media (Content) Types, e.g., application, image, etc.
  • The second tier of concepts are the so-called Media Subtype in each registry, such as application/zip, image/gif, etc.

Note that some extension media types not part of IANA are defined for various reasons, and may or may not be submitted in the future for standardization to IANA. These extensions follow the non-registration practice of [RFC 6838, Section 3.4], and all include the string [/x-uco-].

This repository's primary product is a monolithic ontology and taxonomy file, serialized in Turtle, mime.ttl.
(This repository is undergoing NIST review for release. If you are interested in providing early feedback, please contact @ajnelson-nist .)

UCO could subclass dcterms:MediaType with a new class uco-types:IANAMediaType, and a sibling uco-types:NonIANAMediaType in order to support Requirement 7 and Competency 4.

Coordination

  • Tracking in Jira ticket OC-204
  • Administrative review completed
  • Requirements to be discussed in Ontology Committee (OC) meeting, 2022-05-05T10:00-04:00
  • Requirements Review vote occurred, passing, on 2022-05-05
  • Requirements development phase completed.
  • Solution announced to OCS, 2022-08-17.
  • Solutions Approval to be discussed in OC meeting, 2022-08-25.
  • Solutions Approval vote has not occurred
  • Milestone linked
  • Documentation logged in pending release page
@ajnelson-nist
Copy link
Contributor

ajnelson-nist commented Apr 15, 2022

In response to Competency Question 3, "Which content types carry zip?"

Answers to that question should return the various x formats of Microsoft Office files, e.g. .docx. That is, they should return any file with what would be recorded in today's UCO as:

...
{
  "@type": "observable:ContentDataFacet",
  ...
  "observable:mimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
  ...
}
...

This would be discoverable via taxonomic knowledge that that MIME type is a narrower concept than application/zip. Many of these relationships with zip can be found by review of the IANA documenting page - here is the .docx page. Unfortunately, reviewing for just the string "zip" among those pages won't be a perfect zip-relationship finder, as application/gzip is not related.

ajnelson-nist added a commit that referenced this issue May 4, 2022
This patch adds initial support for SKOS taxonomies as a
hierarchy-enabled alternative to UCO's current string-vocabularies
practice.

While the typical SKOS namespace is prefixed, the import of SKOS is
done by referencing an OWL-DL compatible subset of SKOS as a version IRI
import.  (See SKOS Reference C.3.)

Paul Brandt identified the SKOS strategy used in this patch, though his
initial draft was in another repository.  I ported the property to UCO's
ontology repository due to needing to satisfy references for SHACL.

This patch is a reduction from the first definition of
`core:TaxonomicConcept` drafted for UCO CP-99.

References:
* [OC-140] (CP-99) UCO should provide a SKOS taxonomy of device types
  for observable:deviceType
* #363
* https://www.w3.org/TR/skos-reference/#namespace-documents
  Section C.3, "SKOS RDF Schema - OWL 1 DL Sub-set (informative)"

Co-authored-by: Paul Brandt <paul@brandt.name>
Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
@ajnelson-nist ajnelson-nist linked a pull request May 4, 2022 that will close this issue
13 tasks
ajnelson-nist added a commit that referenced this issue May 4, 2022
Dublin Core's `dcterms:FileFormat` class is an existing class that
represents file formats, including but not limited to IANA Media Types.
(The reference to IANA Media Types is found on the `dcterms:format`
property, where the IANA list is recommended, but not required.)

`dcterms:FileFormat` is defined as an `rdfs:Class`.  In order for UCO,
as an OWL ontology, to use this class as a range of a property
(especially `observable:mimeType`), it needs to be designated as an
`owl:Class`.

Note there are some things this patch does not do:

* This patch does not import Dublin Core.  It appears Dublin Core's
  model is incompatible with OWL 2 DL.  For instance, the property
  `dcterms:format` appears to violate the OWL 2 DL separation of
  datatype properties (range of literals) from object properties (range
  of objects), on review of its sub-property `dcterms:extent` having a
  range of literals.
* This patch does not extend `FileFormat`'s superclasses to also be
  OWL Classes.
* This patch does not adapt the `dcterms:format` property noted above,
  because in addition to its OWL 2 DL compatibility issues, its semantic
  scope extends beyond data formats.

Compatibility notes:

Extending `dcterms:FileFormat` to be both an OWL Class and RDFS Class is
compatible with OWL 2 DL.  Reviewing this record:
https://www.w3.org/TR/2012/REC-owl2-mapping-to-rdf-20121211/#Parsing_of_the_Ontology_Header_and_Declarations
Table 5
Row 2

An ontology that also designates `dcterms:FileFormat` as an `rdfs:Class`
would have the pair of `owl:Class` and `rdfs:Class` statements reduced
to `owl:Class` only as part of its OWL parse.

If an ontology has chosen to import Dublin Core as a whole, it has
already implicitly made a commitment to use OWL 2 FULL instead of
OWL 2 DL.

References:
* #363
* https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#FileFormat
* https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#format

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit that referenced this issue May 4, 2022
For compatibility with other graph data models that use IANA Media
Types, the required range is set to `dcterms:FileFormat`.

References:
* #363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit that referenced this issue May 4, 2022
This patch adds a class hierarchy to distinguish between known IANA
Media Types and Media Types known to not be registered with IANA.

A unit test is added to demonstrate how mimeType being objects can also
enable hierarchical searches, even between IANA and non-IANA types.

A follow-on patch will generate validation result files.

References:
* #363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit that referenced this issue May 4, 2022
References:
* #363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit that referenced this issue May 4, 2022
…feature

A follow-on patch will generate a Make-managed file.

References:
* #363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit that referenced this issue May 4, 2022
References:
* #363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
@ajnelson-nist
Copy link
Contributor

One part of the Solution of this proposal has been posted, in PR 377. The earlier commits document the rationale for some class selection and extensions.

This PR includes demonstration in unit tests of how MIME taxons can be used while looking close to string-literal properties (see kb:file-8 in this file), define taxons as IANA, non-IANA, or only a more general dcterms:FileFormat found to have been used in other RDF data models (see the nodes in this file, or its corresponding tests in test_mime.py).

The MIME demonstrations in that PR also include demonstration of a hierarchy between Media Types, including some "aliasing" relationships. test_mime.py includes demonstrations of finding all application/gzip content, even if that content is encoded as narrower taxons such as application/tar+gzip.

I cannot yet share the repository that generates the IANA Media Registry, but here are a few examples of taxons included in its generated monolithic resource:

<http://purl.org/NET/mediatypes/application/gzip>
        a
                dcterms:FileFormat ,
                skos:Concept
                ;
        dcam:memberOf dcterms:IMT ;
        skos:exactMatch <https://taxonomy.unifiedcyberontology.org/uco/mime/0.0.1/application/gzip> ;
        .

<https://taxonomy.unifiedcyberontology.org/uco/mime/0.0.1/application/gzip>
        a
                prov:Entity ,
                uco-types:IANAMediaType
                ;
        rdfs:isDefinedBy <https://www.iana.org/assignments/media-types/application/gzip> ;
        skos:exactMatch <http://purl.org/NET/mediatypes/application/gzip> ;
        skos:inScheme uco-mime:MIMEScheme ;
        skos:notation "application/gzip" ;
        .

<https://taxonomy.unifiedcyberontology.org/uco/mime/0.0.1/application/tar+gzip>
        a uco-types:NonIANAMediaType ;
        rdfs:comment "The media type suffix +gzip is registered in RFC 8460, Section 6.3."@en ;
        rdfs:seeAlso <https://www.rfc-editor.org/rfc/rfc8460.html#section-6.3> ;
        skos:broader
                <https://taxonomy.unifiedcyberontology.org/uco/mime/0.0.1/application/gzip> ,
                <https://taxonomy.unifiedcyberontology.org/uco/mime/0.0.1/application/tar>
                ;
        skos:notation "application/tar+gzip" ;
        .

I selected those excerpts because:

  • .tar.gz is a file extension encountered in supply chain security review, yet currently lacks a taxon in the IANA registry
  • Dublin Core provides a class and prefix for Media Types in their metadata publishing guide, including the demonstration in Turtle ex:myPicture dcterms:format mime:jpeg ..
  • The second UCO taxon defined in the above excerpt covers some uniqueness constraints found in the SKOS specification. In short, because of these constraints, it can lead to an inconsistent SKOS knowledge base if one tries applying SKOS annotations to concepts outside one's own namespaces. Hence, there is a set of Media Types UCO defines that is mostly parallel with the IANA+Dublin Core set.

There is also a prov:Entity declaration, which handles another "Aliasing" issue that happens to be relevant to UCO users. text/xml is defined in RFC 7303 as an alias of application/xml. There turns out to not be a SKOS mechanism to support one concept being an alias of another when both are in the same concept scheme, so the taxonomy resorts to prov:alternateOf to handle these aliasing cases.

At the end of the day, the (pending-release) taxonomy will permit discovery of all XML files in a knowledge base by including this in a SPARQL query:

?nMimeType
    skos:exactMatch* /
      skos:broaderTransitive* /
        prov:alternateOf* /
          skos:exactMatch* /
            skos:notation "text/xml" ;

@ajnelson-nist
Copy link
Contributor

As an aside, it was surprisingly complex to determine how to work with forward slashes in prefixed concepts between Turtle, JSON-LD, and SPARQL. Each syntax handles forward-slash differently. Further details for those curious are in this bug report and its attached PR. Fortunately, usage of skos:notation within the present proposal saves UCO from having to deal with that nuance, because what one would search for in SPARQL would be a string value, and what one would encode in JSON-LD can look like the Media Type name, with some kind of mime: prefix.

ajnelson-nist added a commit to casework/CASE-Examples that referenced this issue May 10, 2022
A follow-on commit will regenerate Make-managed files.

References:
* ucoProject/UCO#357
* ucoProject/UCO#363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/CASE-Examples that referenced this issue May 10, 2022
References:
* ucoProject/UCO#363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/casework.github.io that referenced this issue May 10, 2022
A follow-on patch will regenerate Make-managed files.

References:
* ucoProject/UCO#357
* ucoProject/UCO#363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/CASE-Examples that referenced this issue May 10, 2022
The objects will temporarily use the drafting namespace while the
taxonomy is being implemented.

This review also caught a typo of MIME types - JPEGs should be
`image/jpeg`, not `image/jpg`.

A follow-on patch will regenerate Make-managed files.

References:
* ucoProject/UCO#363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/CASE-Examples that referenced this issue May 10, 2022
References:
* ucoProject/UCO#363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/casework.github.io that referenced this issue May 10, 2022
The objects will temporarily use the drafting namespace while the
taxonomy is being implemented.

A follow-on patch will regenerate Make-managed files.

References:
* ucoProject/UCO#363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/casework.github.io that referenced this issue May 10, 2022
References:
* ucoProject/UCO#363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
@plbt5
Copy link
Contributor Author

plbt5 commented Jun 3, 2022

@ajnelson-nist What needs to be done before submitting this to solutions review?

@ajnelson-nist ajnelson-nist linked a pull request Jun 22, 2022 that will close this issue
13 tasks
@ajnelson-nist
Copy link
Contributor

The taxonomy repository has been posted here, with some notations that it is currently experimental and needs committee review:

https://github.com/ucoProject/UCO-Taxonomy-MIME

We should now have sufficient demonstration available to show whether these changes to observable:mimeType and related are suitable for adoption.

@sbarnum
Copy link
Contributor

sbarnum commented Aug 25, 2022

This continues to be an invalid change proposal under the CDO charter foundational principles as discussed at length for the issue proposing changing controlled vocabularies to use semantic individuals rather than strings.
This is simply a proposal to make such a change for a single vocabulary. It is invalid for the same reasons the broader issue is invalid.
This (mime types) is even one of the two example areas (the other being kindOfRelationship) explicitly discussed where the proposed changes will quickly lead to problems because they are areas that UCO can never assure that they are complete. Both of these areas will always include values where UCO individuals are not current or where specific parties may need to use values outside of any appropriate standardization within UCO. Ironically, to demonstrate this concern the official IANA media type registry has been modified/updated on Aug 22, 2022 since the MIME taxonomy proposed here was last updated meaning it is already out of date.
The proposal is invalid due to a requirement that a user ontologically define an individual for a mime type (for which there is a clear string format standard defined) they wish to use that is outside the current UCO taxonomy version which is an impractical burden for the large majority of likely adopters. In this case, if a user needs to reference a MIME type not currently in the UCO taxonomy, such as if they submitted it and it was approved for IANA registration but a new UCO taxonomy updated version has not been released, they will be forced to define their own ontological entry (an already impractical burden) but now when the UCO taxonomy is updated with the new IANA registry they will now have divergent content (the interim individual they created and the new "official" individual).

When the issue of the invalidity of the proposal for changing vocabs to semantic individuals was discussed the ball was placed in the proposers court to suggest an approach that allowed the rigor of semantic individuals while allowing the required flexibility of strings. I have seen no such suggestion and this specific change proposal does not offer such a suggestion but rather is simply a single subcase of the broader proposal that was rejected.

@ajnelson-nist
Copy link
Contributor

Re: Rejection of broader proposal

I do not believe the proposal was rejected, but rather withheld due to lack of technical knowledge. We have significantly more technical knowledge now.

Re:

... under the CDO charter foundational principles as discussed at length ...

Only one person has declared those principles foundational, and other similar-strength language, and we have already seen those principles (I suspect the act one you're referencing) be used to attempt to move UCO away from a technology that is essential for any interoperability with other RDF-based technologies, which would be a significant act of UCO's isolation if enacted.

The entire UCO design document is still under committee review.

Re: incompleteness

I agree on lag being a possibility. The taxonomy does create a need for an operational tempo. In support of a regular deployment tempo, maintenance of the taxonomy for adding is almost entirely scriptable. make already handles refreshing the taxonomy entirely (observe its CI). It doesn't yet handle stamping a new versionIRI, but that scripted step can be added as well. I'm not sure if Github steps on generating releases could or should be automated, but "Refreshes" could be a calisthenic for Github mechanics, which our community does need in discussions with the "Product Manager" role that has been left unfilled for a while.

Re:

The proposal is invalid due to a requirement that a user ontologically define an individual for a mime type (for which there is a clear string format standard defined) they wish to use that is outside the current UCO taxonomy version which is an impractical burden for the large majority of likely adopters.

You missed some details where the semi-open vocabulary enforcement pattern of gently (sh:Info-severity) suggesting using individuals stronger than dcterms:FileFormat is used.

mimeType is changed to the class dcterms:FileFormat. Those individuals exist outside of UCO, and don't need to exist as defined by Dublin Core. (Indeed, today they aren't deployed, which is a graph-provisioning matter to discuss with their team. I've had some pleasant maintenance interaction with them recently, though, so I think this is approachable.)
observable:mimeType-class-types-MIMEFormat implements that gentle suggestion.

A firmer sh:Warning-severity shape requests whatever individuals used have skos:Notation so the text form can be available. I don't think that should be made into a sh:Violation-severity shape, because if the individuals do get provided by Dublin Core, they aren't guaranteed to use SKOS concepts.

Re: Keeping mimeType a string

I have not seen arguments against the benefits of this proposal that strings make more difficult.

  • Does the MIME type for JPEG pictures have the e in jpeg or no? How do we enable interoperability when two significant knowledge bases disagree?
  • What is the MIME type string for tarballs? I found a few candidate spellings. How would UCO encode that this set of non-IANA-recognized values mean "tarball"?
    • application/tar+gzip
    • application/x-tar+gzip
  • Many formats are specialized zips. How could we encode that application/vnd.openxmlformats-officedocument.wordprocessingml.document entails an additional MIME type of application/zip?

I still believe UCO's current pattern pattern of semi-open vocabularies will hit a technical insurmountability with "Extension" vocabularies that are sets of strings that somehow use a non-public namespace. I admit the implementation in this MIME proposal does not yet get all the way to using the pattern I think the other properties will need. Other properties, especially kindOfRelationship as a string, need to be anchored off of unique string labels of concepts tied into classes in UCO. For the MIME proposal, SHACL-SPARQL seems to be the only functional mechanism.

If this proposal is not accepted today, I will adjust it. The MIME taxonomy does by itself serve a need in CASE-Corpora (DCAT uses dcterms:FileFormat), so we may need to look to demonstration in CASE-Corpora to understand when the implementation is satisfactory.

@sbarnum
Copy link
Contributor

sbarnum commented Aug 25, 2022

Only one person has declared those principles foundational

This is absolutely NOT one person declaring those principles foundational.
They are explicitly defined as such in the CDO charter and the charter specifically states that changes can not be made by the TSC or OCs that violate them.
It is simply not an option and not valid to hold a requirements review or solutions approval vote on.

As discussed before, I fully recognize the benefits fo having a formally defined taxonomy (some of which you identify above) and would love to have one. However, we cannot require users of UCO to refer to values directly as members of such a formal taxonomy for all the reasons discussed. It is very impractical for the large majority of UCO target users to have ontological knowledge and more specifically knowledge of how UCO uses ontology to define taxonomy individuals simply to express values outside what UCO has defined, especially for values such as MIMEType and kindOfRelationship where any UCO taxonomy or any other is guaranteed to be incomplete.
I have no objection to including the taxonomy as a definition reference for string values used but it must not be the direct mechanism for expressing those values.
At a minimum, we need to be able to express values as strings.
It would be better to have that with a reference taxonomy of known (vocab) string values.
It would be best to find a way to more formally bridge between flexible string values and the reference taxonomy when used values align to reference definitions for those values.
I have not yet seen a proposal for how to do that last part.

@ajnelson-nist
Copy link
Contributor

This was not voted on today for a few reasons. Not necessarily all of them are listed here:

  • There is an operations issue where the MIME taxonomy would need to have a designated maintenance team.
  • There is a solution possible using a new object-property, something like observable:mimeTaxon, that would let workflows that have stronger needs for graph individuals use those types rather than the string-typed mimeType.
  • There is likely a more general design solution to semi-open vocabularies, that could let this proposal turn into a backwards-compatible change, and not require implementing a string-enumerant rdf:List in the uco-vocabulary namepace. The solution would likely involve a SHACL-SPARQL constraint.

A further note on the current uncontrolled nature of mimeType strings: This property currently has no restrictions on its values. A proposal was raised in the committee call this morning to implement a vocabulary, and I opposed it for an unfortunate technical reason. There is a performance issue with rdf-toolkit.jar where it suffers a significant slowdown when working with rdf:Lists of any appreciable length. I'm not sure what the exact number for "Appreciable" is, but the observable relationship vocabulary currently in the uco-vocabulary namespace is past that number, and it currently takes over a minute to normalize that file's contents. If this is a super-linear slowdown, a mimeType controlled vocabulary set could make normalizing the uco-vocabulary namespace untenable. This is an unfortunate issue that needs to be worked through with the maintainers of rdf-toolkit.

Looking forward, I still believe IRI-identified graph individuals will be a necessity as a full replacement for the current UCO semi-open vocabulary design. I think the resolution, using classes developed for this proposal, would involve something like this query as part of a sh:PropertyShape with target sh:targetObjectsOf observable:mimeType:

SELECT $this
WHERE {
    FILTER NOT EXISTS {
        ?nTaxon
            a/rdfs:subClassOf* types:MIMEFormat ;
            skos:notation $this ;
            .
    }
}

I think that is a correct way, but using a NOT EXISTS makes me wonder if it's a sufficiently fast way.

I think this will need to be exercised further before returning for committee review. Likely, it will be tried in CASE-Corpora as that graph grows, particularly with an eye towards "govdocs1".

@ajnelson-nist ajnelson-nist removed this from the UCO 1.0.0 milestone Aug 25, 2022
@ajnelson-nist ajnelson-nist added this to the UCO 1.x.0 milestone Aug 25, 2022
ajnelson-nist added a commit to ucoProject/UCO-Archive that referenced this issue Aug 25, 2022
This reverts commit f4a0111, reversing
changes made to 296497e.

Reverting from `unstable` branch due to scheduling for post-1.0.0.

References:
* ucoProject/UCO#363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/CASE-Archive that referenced this issue Aug 25, 2022
References:
* ucoProject/UCO#363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/CASE-Examples that referenced this issue Aug 25, 2022
A follow-on patch will regenerate Make-managed files.

References:
* ucoProject/UCO#363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/CASE-Examples that referenced this issue Aug 25, 2022
References:
* ucoProject/UCO#363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/casework.github.io that referenced this issue Aug 25, 2022
A follow-on patch will regenerate Make-managed files.

References:
* ucoProject/UCO#363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/casework.github.io that referenced this issue Aug 25, 2022
References:
* ucoProject/UCO#363

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants