Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace long list of namespaces with list of prefixes used when using serialize. #1679

Closed
wmelder opened this issue Jan 18, 2022 · 18 comments · Fixed by #1686
Closed

Replace long list of namespaces with list of prefixes used when using serialize. #1679

wmelder opened this issue Jan 18, 2022 · 18 comments · Fixed by #1686
Labels
bug Something isn't working serialization Related to serialization.

Comments

@wmelder
Copy link

wmelder commented Jan 18, 2022

I am wondering why the latest version of rdflib gives me a long list of namespaces when serializing a Graph to JSON-LD. It didn't used to be like that.

            return self.graph.serialize(
                format=return_format,
                context=dict(self.graph.namespaces()),
                auto_compact=True
            )

The return_format is 'json-ld'. The context in the result is:

  "@context": {
    "brick": "https://brickschema.org/schema/Brick#",
    "csvw": "http://www.w3.org/ns/csvw#",
    "dc": "http://purl.org/dc/elements/1.1/",
    "dcam": "http://purl.org/dc/dcam/",
    "dcat": "http://www.w3.org/ns/dcat#",
    "dcmitype": "http://purl.org/dc/dcmitype/",
    "dcterms": "http://purl.org/dc/terms/",
    "doap": "http://usefulinc.com/ns/doap#",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "gtaa": "http://data.beeldengeluid.nl/gtaa/",
    "non-gtaa": "http://data.beeldengeluid.nl/nongtaa/",
    "odrl": "http://www.w3.org/ns/odrl/2/",
    "org": "http://www.w3.org/ns/org#",
    "owl": "http://www.w3.org/2002/07/owl#",
    "prof": "http://www.w3.org/ns/dx/prof/",
    "prov": "http://www.w3.org/ns/prov#",
    "qb": "http://purl.org/linked-data/cube#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "schema": "https://schema.org/",
    "sdo": "https://schema.org/",
    "sh": "http://www.w3.org/ns/shacl#",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "sosa": "http://www.w3.org/ns/sosa/",
    "ssn": "http://www.w3.org/ns/ssn/",
    "time": "http://www.w3.org/2006/time#",
    "vann": "http://purl.org/vocab/vann/",
    "void": "http://rdfs.org/ns/void#",
    "xml": "http://www.w3.org/XML/1998/namespace",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },

This is where the graph and namespace bindings (including some custom ones) were created:

        self.graph = Graph()
        self.graph.namespace_manager.bind("skos", SKOS)
        self.graph.namespace_manager.bind("gtaa", Namespace(self._model.GTAA_NAMESPACE))
        self.graph.namespace_manager.bind("non-gtaa", Namespace(self._model.NON_GTAA_NAMESPACE))

Further, in the custom class I add another namespace and triples. Here's a fragment:

        self.graph.namespace_manager.bind('sdo', Namespace(self._model.SCHEMA_DOT_ORG_NAMESPACE))
        # create a node for the record
        self.itemNode = URIRef(self.get_uri(concept_type, metadata["id"]))

        # get the RDF class URI for this type
        self.classUri = self._model.CLASS_URIS_FOR_DAAN_LEVELS[concept_type]

        # add the type
        self.graph.add((self.itemNode, RDF.type, URIRef(self.classUri)))

The custom Class is used to read some JSON from a backend system, interpret this and generate RDF for the item. You could see this as a wrapper pattern.

As I wrote in comments to this issue, I had to create some custom function to remove unused prefixes from the context, but that code is not so dynamic:

    def remove_unused_prefixes(self):
        """ Clean up the long list of namespaces.
        """
        context = dict(self.graph.namespaces())
        used_prefixes = ['gtaa', 'non-gtaa', 'rdf', 'rdfs', 'sdo', 'skos', 'xml', 'xsd']
        return {p: context[p] for p in used_prefixes}

and this is used here:

            context_used = self.remove_unused_prefixes()
            return self.graph.serialize(
                format=return_format,
                context=context_used,
                auto_compact=True
            )

Now I just discovered that the context argument can be left out. This is probably because of recent improvements and integration of json-ld. Well done. But it still gives me the long list. Also, when omitting the context argument the auto_compact=True argument finally gives me a short representation that I wanted, for example: "sdo:datePublished": "2006-02-19",. This is not the case when using this context = dict(self.graph.namespaces()). But, after all I still get the long list of namespace that I aren't used.

Another discovery: when further reducing the number of arguments, I still get JSON-LD, but no context at all in the results.

           return self.graph.serialize(
                format=return_format
            )

To summarize this issue: I would like to get JSON-LD serialization including context, but with a minimal list of (used) prefixes/namespaces in response to this request:

            return self.graph.serialize(
                format='json-ld',
                auto_compact=True
            )

I hope provided examples will help.

@nicholascar
Copy link
Member

OK, confirming I can reproduce this problem like this:

# establish a minimal graph
from rdflib import Graph

g = Graph()
g.parse(
    data='''
        PREFIX foaf: <http://xmlns.com/foaf/0.1/>
        <a:> a <b:> ;
             foaf:name "Nick" .
        ''',
    format="turtle"
)

Normal JSON-LD serialization that is fine:

print(g.serialize(format="json-ld"))

yields

[
  {
    "@id": "a:",
    "@type": [
      "b:"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@value": "Nick"
      }
    ]
  }
]

BUT, if I set auto_compact=True, i.e. print(g.serialize(format="json-ld", auto_compact=True)) then I get:

{
  "@context": {
    "brick": "https://brickschema.org/schema/Brick#",
    "csvw": "http://www.w3.org/ns/csvw#",
    "dc": "http://purl.org/dc/elements/1.1/",
    "dcam": "http://purl.org/dc/dcam/",
    "dcat": "http://www.w3.org/ns/dcat#",
    "dcmitype": "http://purl.org/dc/dcmitype/",
    "dcterms": "http://purl.org/dc/terms/",
    "doap": "http://usefulinc.com/ns/doap#",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "geo": "http://www.opengis.net/ont/geosparql#",
    "odrl": "http://www.w3.org/ns/odrl/2/",
    "org": "http://www.w3.org/ns/org#",
    "owl": "http://www.w3.org/2002/07/owl#",
    "prof": "http://www.w3.org/ns/dx/prof/",
    "prov": "http://www.w3.org/ns/prov#",
    "qb": "http://purl.org/linked-data/cube#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "schema": "https://schema.org/",
    "sh": "http://www.w3.org/ns/shacl#",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "sosa": "http://www.w3.org/ns/sosa/",
    "ssn": "http://www.w3.org/ns/ssn/",
    "time": "http://www.w3.org/2006/time#",
    "vann": "http://purl.org/vocab/vann/",
    "void": "http://rdfs.org/ns/void#",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@id": "a:",
  "@type": "b:",
  "foaf:name": "Nick"
}

Process finished with exit code 0

Based on #1676 which was just lodged, it looks like the extra prefixes come through for other formats too, just not for Turtle or certain forms of JSON-LD.

@nicholascar nicholascar added bug Something isn't working serialization Related to serialization. labels Jan 18, 2022
@nicholascar
Copy link
Member

So indeed RDFlib recently added all namespaces defined in the Namespaces module to graphs by default (see https://github.com/RDFLib/rdflib/blob/master/rdflib/namespace/__init__.py#L341) however some formats seem to auto-clean away unused prefixes but not all do.

@hsolbrig
Copy link
Contributor

Perhaps we should find the code that does the auto-clean and make it generally available? We could make use of it in some of our own serializers...

@nicholascar
Copy link
Member

Perhaps we should find the code that does the auto-clean

Yes, I think that's what's needed here. I worry though that it may not be on "piece" of code but rather some on-going record that the Turtle/N3 serializer is keeping of all namespaces seen as it operates, line-by-line, which it then uses to drop unseen ones from the namesapces list with. If that's the case, it won't be transferable and to re-examine the graph for unused namespaces could be expensive.

We could always reconsider the policy to include the 20 or so namespaces in the namespaces module by default...

@niklasl
Copy link
Member

niklasl commented Jan 20, 2022

I distinctly recall making sure the JSON-LD serializer (especially the auto-compact) didn't add spurious prefixes a long time ago. Either some change deliberately altered that, or something underneath changed (and if so obviously didn't effect all serializations). Not looking for "blame", just thinking there might be something complex going on (e.g. routines on different levels working "against" each other).

@nicholascar
Copy link
Member

nicholascar commented Jan 20, 2022

just thinking there might be something complex going on

Perhaps it's that the original no-unused-profixes code worked fine before we added the specific binding in namespasec.py:L341 which then overrides it in some way. So this is on the different "levels" idea you mention.

All my fault in trying to include more prefixes by default to make prettier namespaces rather than ns, ns2 etc...

@wmelder
Copy link
Author

wmelder commented Jan 20, 2022

Maybe it is possible to have the 'normalizeUri' function add prefixes whenever they are seen, instead of binding all by default?

@nicholascar
Copy link
Member

Something like that, but it will be a bit tricky to have the namespace object keep track of all used IRIs efficiently (a call back to a dict of IRI namespaces I suppose, using normaliseURI, as you suggest, but for potentially every subject, predicate & object)

@wmelder
Copy link
Author

wmelder commented Jan 20, 2022

@nicholascar I noticed 'compute_qname' already binds a namespace when it is not in the graph namespaces yet. I don't know exactly when this is used, but I guess whenever a triple is added? More general, whenever a triple is added or an RDF file is parsed, wouldn't that be the place to check whether the graph 'knows' the namespace (for s, p and o) and add new ones?

@cmungall
Copy link
Contributor

We could always reconsider the policy to include the 20 or so namespaces in the namespaces module by default...

As an rdflib user this would be my preference, rather than post-processing to remove. I would like to be in complete control over namespaces (perhaps with the exceptions of the true core ones like rdf/rdfs/owl/xsd).

default to make prettier namespaces rather than ns, ns2 etc

it's true these are not so pretty but maybe keeping things simplest is the best? Perhaps there could be some convenience methods to pre-populate namespaces with either the 20 or so prefixes, or even pre-populate from an existing prefix registry?

@wmelder
Copy link
Author

wmelder commented Jan 21, 2022

So indeed RDFlib recently added all namespaces defined in the Namespaces module to graphs by default (see https://github.com/RDFLib/rdflib/blob/master/rdflib/namespace/__init__.py#L341) however some formats seem to auto-clean away unused prefixes but not all do.

Indeed, this behaviour is triggered by the Graph initialization. It takes a NamespaceManager as argument and this NMS is adding the list of namespaces. So, it is not a serialization problem. Serialize only writes what is in the Graph.

@nicholascar
Copy link
Member

OK, I think the best bet is to wind back the adding of 20 or so namespaces! We can revert to adding just RDF, RDFS, OWL & XSD (and XML for RDF/XML) and then think of some new function or graph init flag to add in others, such as the 20 in namespaces/.

I'll try and come up with a PR for this.

@wmelder
Copy link
Author

wmelder commented Jan 22, 2022

Thanks, I would appreciate that.

Some SPARQL editors have a (reverse) lookup function that matches namespaces to prefix and vice versa using prefix.cc. I guess that we don't want RDFLIB to have an external dependency like that, but what if RDFLIB provides (reverse) prefix lookup internally? I mean, just so that when RDF is parsed or added, the used prefix/namespaces are added to the Graph automatically? It could use this context with all preferred namespace-prefix pairings for example.

@niklasl
Copy link
Member

niklasl commented Jan 22, 2022

OK, I think the best bet is to wind back the adding of 20 or so namespaces! We can revert to adding just RDF, RDFS, OWL & XSD (and XML for RDF/XML) and then think of some new function or graph init flag to add in others, such as the 20 in namespaces/.

I'll try and come up with a PR for this.

That sounds good!

One thing about the XML namespace: it "shouldn't" really be defined as an rdflib.Namespace, and certainly not be bound in the Graph. It is never declared in RDF/XML documents, nor any XML. While xml:lang is certainly used in RDF/XML, it is just core XML syntax (implicitly declared and a reserved prefix).

Given that it is used in RDFLib XML processing code for attribute access I'm not proposing to remove it entirely. But it ought to be clarified that it is not to be used as an "RDF namespace" (i.e. a vocabulary/ontology prefix), and thus I don't think it should ever be bound in a graph (unless intentionally, for reasons I don't understand but are not up to me to question).

@ashleysommer
Copy link
Contributor

ashleysommer commented Jan 23, 2022

I'll chime in here too, last week I found that latest version of RDFLib (with the extra prefixes in the Namespaces) caused PySHACL to start failing a test. Tracked it down to a conflicting namespace (test defined prefix to be X, but RDFLib defined prefix to be Z) that caused the test to no longer pass, however it did expose a problem with the test (misspelled ontology URI) and broken logic in the test (it shouldn't have passed to begin with). That helped me fix the test.

Anyway, I'm in favour of keeping the prefixes, and auto-removing them in the serializers.

@nicholascar
Copy link
Member

I'm part way to solving this, see #1686

@mgberg
Copy link
Contributor

mgberg commented Jun 1, 2022

Another scenario where I would personally find it useful to have JSON-LD auto_compact only show "used" prefixes (like the N3 serializer does) is when you are serializing one graph from a Dataset. The Dataset and Graphs obtained using the get_context or graph methods share a NamespaceManager, and one named graph will likely not contain usages of all the prefixes bound (automatically or manually) to the Dataset. Therefore, you can still end up with a large dictionary containing some unused prefixes for the context of a serialization of one named graph in a Dataset.

@cmungall
Copy link
Contributor

cmungall commented Aug 8, 2023

In case anyone ends up here confused, as I often do, I believe the current status is that this was fixed in rdflib 6.2.0, but then regressed in future versions

See #2103 for discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working serialization Related to serialization.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants