-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve current bug in UCO that does not require globally unique IDs for all class objects #430
Comments
I believe this proposal is strategically wrong and will file two proposals correcting underlying issues. The short is |
Looking again, I now think only the parts of this proposal pertaining to However, there is another piece that I think is missing from your solution suggestion. We allow Last, I remember we had discussed this before in Jira, and I had asked you for an example and you might not have gotten a notice of the Jira comment. How would you represent a file that has a hash? I think that is going to be an essential sanity-check. |
@sbarnum : Also, if the top-most class in UCO would now be []
a owl:AllDisjointClasses ;
owl:members (
array:ArrayOfAction
tool:BuildConfigurationType
# ... there are actually quite a lot ...
core:Facet
core:UcoObject
# ...
) ;
. It's actually a bit of a surprise when looking at what Protege displays as subclasses of |
@ajnelson-nist Good catch on changing sh:nodeKind sh:BlankNodeOrIRI to sh:nodeKind sh:IRI on ObjectProperty SHACL shapes. Here is an example of a file with a hash: {
"@id": "kb:file-a0a69ece-da9c-4256-a9a8-5dec82a4ad1f",
"@type": "uco-observable:File",
"uco-core:hasFacet": [
{
"@id": "kb:ContentDataFacet-1e54fa5e-1399-476c-8aa7-00781b8c12db"
"@type": "uco-observable:ContentDataFacet",
"uco-observable:hash": [
{
"@id": "kb:hash-87c24a7f-a0d2-41a3-a726-0521a5c7bc8c",
"@type": "uco-types:Hash",
"uco-types:hashMethod": {
"@type": "uco-vocabulary:HashNameVocab",
"@value": "SHA256"
},
"uco-types:hashValue": {
"@type": "xsd:hexBinary",
"@value": "e5ca3be56f66200a1bb2262e948ac08dbc672bc8033c1ada743787b0c667dea6"
}
}
]
}
]
} |
I have no objections to expanding the disjoint statement to include all classes that only have owl:Thing as a superclass (i.e. add in all of the classes that are neither subclasses of UcoObject or Facet). |
FYI, the |
I very fundamentally disagree with the assertion to remove core:id and core:type properties. |
Oops. I got id happy. LOL> I will fix it. |
I fixed the example to remove my extraneously added ids. |
I updated the CP to include the changes to the ObjectProperty SHACL shapes `sh:nodeKind' and the class disjoint statement. |
I realized that our JSON-LD context should contain the following: "core:id": "@id",
"core:type": "@type", Rather than "id": "@id",
"type": "@type", In this way the plain json cleanly aligns to the ontology as expected and the context does the work of mapping those properties to @id and @type. We can also add any documentation we want to the json-ld context file outside of the "context" definition object that documents details of our json-ld serialization. The processor will simply ignore the extra content. I am going to make the above change to the json-ld context proposal. |
"core:id": "@id",
"core:type": "@type", That breaks JSON-LD if |
In terms of what UCO has committed to developing technologically for 1.0.0, JSON-LD is in scope, and we are trying very hard for JSON that is not JSON-LD. Other non-RDF syntaxes have not been presented as specific use cases. |
Further, |
Re: {
"@id": "kb:hash-87c24a7f-a0d2-41a3-a726-0521a5c7bc8c",
"@type": "uco-types:Hash",
"uco-types:hashMethod": {
"@type": "uco-vocabulary:HashNameVocab",
"@value": "SHA256"
},
"uco-types:hashValue": {
"@type": "xsd:hexBinary",
"@value": "e5ca3be56f66200a1bb2262e948ac08dbc672bc8033c1ada743787b0c667dea6"
}
} This On the brighter side, if As a summary effect: I would like
To:
May we expand the scope of this proposal to include this revision to |
I think I may have discovered the root of our disconnect. I just noticed that types:Identifier is currently only defined as a generic rdfs:Datatype with no further detail. I think the issue is we need to complete the definition of types:Identifier as described above. Once that is done, I believe the rest of this CP should work unless I am completely missing something. At that point types:Identifier is a string with particular value constraints. in the json-ld context simply changes "uco-core:id": "kb:hash-87c24a7f-a0d2-41a3-a726-0521a5c7bc8c",
"uco-core:type": "uco-types:Hash", to "@id": "kb:hash-87c24a7f-a0d2-41a3-a726-0521a5c7bc8c",
"@type": "uco-types:Hash", The value strings do not change at all. They are valid values of the core:id and core:type properties including core:id being range of types:Identifier. They are also valid in json-ld as the value of @id is a valid node identifier and the value of @type is a valid rdf:type string identifer. Am I missing some other dimension to this or was the root of our disconnect the fact that types:Identifier is currently incompletely defined. |
You are still not understanding that trying to use this will break JSON-LD: "core:id": "@id",
"core:type": "@type", Please test that. |
You are correct that
are invalid. I still have not seen any convincing argumentation/evidence that the presence of core:id and core:type "break" anything. What I have seen is an assertion that they may be confusing in regards to the bindings for these concepts to RDF which I would agree with. The remaining challenge I see is how we express the requirements for these concepts/properties if we remove them. For other serializations these requirements and linkages are not implicit and we need a way to convey them. While RDF/JSON-LD are the specific minimally targeted and fully supported serializations for 1.0.0 there is a significant difference between the intention to fully support other serializations and simply to not make decisions that block them. It has always been a fundamental principle of UCO that our serialization support is inclusive not exclusive. For 1.0.0 we are not going to fully flesh out serialization support beyond RDF/JSON-LD but we need to make sure we do not presume that these will be the only serializations for UCO and make design/implementation decisions that prevent other serializations from being practical. If we can identify how we can do the following without the core:id and core:type properties then I am okay with removing them for 1.0.0:
|
Re:
I feel this is an impossible requirement to satisfy a priori. I know of no enumeration of serialization formats broken out by whether they have an elementary structure of a node identifier or not. XML outside of RDF doesn't. YAML...I don't know. For the targeted support serializations, which are based on RDF, Re:
You can say "Desired," but it would be a complete information siloing act to say say "Required." If you require a format for node identifiers, UCO is incompatible with every application that predates UCO, where <http://www.wikidata.org/entity/Q2464882>
rdfs:label "Netherlands Forensic Institute"@en . (Edit: I'd initially copied the URL instead of the concept IRI. Now fixed here and below.) If this next block of Turtle is invalid UCO because of that yet-unspecified <http://www.wikidata.org/entity/Q2464882>
a uco-identity:Organization ;
rdfs:label "Netherlands Forensic Institute"@en . I do not think it would be helpful for UCO to attempt prescribing any type of format for concept IRIs. I'd omitted removing the Last, re: For RDF-based applications, I think this proposal's requirements on nodes bearing non-blank identifiers can be satisfied with |
@sbarnum , something else you should be aware of: Some JSON-LD serializers are likely to make every node that has an I'm not actually 100% sure whether there is a technical solution to this yet, or if the problem has non-standard workarounds, but there is a specification that tries to say when some objects, even with |
Also, there is a slight error in some of the motivation for this proposal:
This is incorrect if remaining in the context of RDF processors sending data between one another. If a blank node is loaded, the RDF processor must generate a process-local identifier on reading. These two files would not cause a conflict if loaded into the same graph instance: _:x rdfs:comment "I am node x." . _:x rdfs:comment "I am node x." . Yes, they are the same content to the eye, but the engine will assign a new (typically skolemized) random-ish identifier in place of I believe there is next to no risk of ID conflicts when merging content. That said, there are other detractors to using blank nodes, because even when you see their name serialized like |
@sbarnum - while reviewing UCO's Jira backlog, I came across OC-200 that runs through a whole list of things (many in the observable namespace) that have no parent class. Rather than enumerate those classes here, I believe the solution of this proposal needs to incorporate the following SPARQL query into CI, failing CI if there are any finds other than your proposed top-level class. SELECT ?nClass
WHERE {
?nClass a owl:Class .
FILTER NOT EXISTS {
?nClass rdfs:subClassOf ?nOtherClass .
}
} That query should be run against the monolithic build of UCO (a temporary artifact of the CI workflow under |
Also, a style matter, more artistic opinion than technical issue:
|
As a further argument for "Here in my graph, I have X, a UCO types hash, which is also a UCO core class base, which is also an OWL thing." Versus: "Here in my graph, I have X, a UCO types hash, which is also a UCO core UCO thing, which is also an OWL thing." |
I agree on having a CI SPARQL check to ensure all classes have defined superclasses. I also do not object to core:UcoThing. |
I state with the certainty of experience that blank nodes WILL cause integrity issues when merged into a graph store. Unique IRI's are required for all objects. |
@sbarnum : you made a few claims in yesterday's meeting, about blank node behaviors, that did not agree with my understanding of some specification---I assume RDF's---and how blank nodes behave when consumed by multiple tools. That is one of the key motivators for this proposal, and your citation chain currently stops at "[your] experience." Part of the solution for this proposal will be implementing this query as part of a SHACL-SPARQL constraint: SELECT ?nThing
WHERE {
?nThing a/rdfs:subClassOf* uco-core:UcoObject .
FILTER (
! REGEX (
STR(?nThing),
"[0-9a-f]{8}-[0-9a-f]{4}-[0-5][0-9a-f]{3}-[0-9a-f]{4}-[0-9a-f]{12}$",
"i"
)
)
} (That will be adapted to use I believe this is a pretty significantly CPU-expensive query to compute, and person-expensive query to review when a use case justifies using an IRI form that does not end with UUIDs. I would strongly prefer its usage be justified by more than "Your experience." Can you please provide, for the understanding of users downstream who come to UCO complaining about the runtime or log-volume of this review rule:
_:x <http://www.w3.org/2000/01/rdf-schema#comment> "I am anonymous-node x." ; _:x <http://www.w3.org/2000/01/rdf-schema#comment> "I am ANOTHER anonymous-node x." ; I had expected any RDF 1.1-conformant tool that loads those two files would have two independent subjects with one comment each, not one subject with two comments. I haven't seen |
References: * #430 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
A follow-on patch will regenerate Make-managed files. References: * #430 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
References: * #430 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
A follow-on patch will regenerate Make-managed files. References: * ucoProject/UCO#430 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
References: * ucoProject/UCO#430 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
References: * ucoProject/UCO#430 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
A follow-on patch will regenerate Make-managed files. References: * ucoProject/UCO#430 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
References: * ucoProject/UCO#430 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
A follow-on patch will regenerate Make-managed files. References: * ucoProject/UCO#430 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
References: * ucoProject/UCO#430 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
References: * ucoProject/UCO#430 * ucoProject/UCO#467 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
References: * ucoProject/UCO#430 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
References: * ucoProject/UCO#430 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
A follow-on patch will regenerate Make-managed files. References: * ucoProject/UCO#430
References: * ucoProject/UCO#430 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
A follow-on patch will regenerate Make-managed files. References: * ucoProject/UCO#430 * [ONT-295] Release CASE 1.0.0 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
References: * ucoProject/UCO#430 * [ONT-295] Release CASE 1.0.0 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
A follow-on patch will regenerate Make-managed files. References: * ucoProject/UCO#430 * [ONT-295] Release CASE 1.0.0 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
References: * ucoProject/UCO#430 * [ONT-295] Release CASE 1.0.0 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
One potential bug has been flagged with this shape, implemented in UCO Issue 406: `uco-owl:ObjectProperty-shacl-constraints-shape` The `sh:PropertyShape` raising the bug has been given an IRI in order to link a deactivation rationale. A new shapes file `debug.ttl` has been added to disable that shape until a test is written to confirm the CASE-Corpora shape is correct. `Facet`s that were blank nodes have been given IRIs, per the implementation of UCO Issue 430. New `sh:Info`-severity violations are reported for some URLs treated in the "URL as an `rdfs:Resource` manner, which will not be given UUID endings. `case_validate` is called with `--allow-warnings`, but is intended to be called with `--alow-infos`; that will have to wait for `case-utils` Issue 70 to resolve. Imports of CASE and UCO ontologies now use their `owl:versionIRI`s, implemented in UCO Issue 437. A follow-on patch will regenerate Make-managed files. References: * casework/CASE-Utilities-Python#70 * ucoProject/UCO#406 * ucoProject/UCO#430 * ucoProject/UCO#437 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
This is expected to trigger a CI failure from at least usage of blank nodes for UCO concepts, disallowed with the release of UCO 1.0.0. References: * ucoProject/UCO#430 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
Background
The following excerpted portion of the UCO Design Document (https://unifiedcyberontology.org/resources/uco_design_document.html) provides a summary overview of the various types of classes in UCO and how they work together.
The last line of the above excerpt is very important and highlights an overlooked bug in the current and past implementations of UCO.
Currently, only UcoObject specifically codifies the core:id and core:type properties providing/requiring a globally unique identifier for each instance of the class.
Without such a codification and requirement, subclasses of core:Facet or any other structured classes (core:ExternalReference, marking:GranularMarking, observable:MimePartType, etc) in UCO are simply treated as blank nodes with a locally (NOT globally) defined ID.
From the W3C wiki page (https://www.w3.org/wiki/BlankNodes) on blank nodes:
This means that UCO content within a single file or produced within a single, uniform store of information has the potential to hang together in a coherent fashion but as soon as you attempt to merge or blend graphs from different files or information stores (a critical fundamental purpose for UCO) the graph falls apart as the lack of globally unique IDs on non-UcoObject class objects means that they lose coherence with the UcoObject they are part of. Local NodeIds are typically assigned by RDF processors following similar or identical algorithms for each set of content leading to a certainty of ID conflicts in merged content.
This is a critical bug that needs addressed.
Requirements
Requirement 1
Every individual instance of a UCO class must have a globally unique id
Requirement 2
Merged graphs of UCO content from different files, information stores or producers must maintain relational graph integrity where non-UcoObject class objects maintain unique and coherent relation to the UcoObjects they are an inherent part of.
Risk / Benefit analysis
Benefits
Content blended from multiple UCO graphs (a fundamental purpose of UCO) will be possible.
Risks
Increases each non-UcoObject class object by one property.
Existing examples will need to be updated.
Competencies demonstrated
Competency 1
Maintain integrity of UCO content in merged graphs from multiple origins
Competency Question 1.1
Query a UcoObject containing inherent embedded class content (e.g. a File observable object containing a FileFacet with property content)
Result 1.1
Return the full UcoObject with all of the embedded (FileFacet) content with accuracy and integrity
Competency Question 1.2
Query a merged graph for multiple UcoObjects (from different origin graphs) containing inherent embedded class content.
Result 1.2
Return the full UcoObject swith all of the embedded (FileFacet) content with accuracy and integrity
Solution suggestion
sh:nodeKind
sh:IRI
rather thansh:nodeKind
sh:BlankNodeOrIRI
This proposed solution of utilizing a defined common base class for all UCO classes to specify the required globally unique ID for all classes is cleaner than simply adding core:id and core:type to each of the non-UcoObject classes in UCO. It is also easier to maintain and provides better coherence to the UCO class tree and cleans up much of the current messiness in the class hierarchy.
Examples
This simple example is from the same Section 3 of the UCO Design Document as the excerpt quoted in the Background section above:
Coordination
develop
develop
The text was updated successfully, but these errors were encountered: