Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH#2256 Introduce a query optimizer concept #2257

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

JervenBolleman
Copy link
Contributor

First implementation is to make it more likely that VALUES is used on the left side of a join.

Added a test, but the optimizer is not invoked by default at this time

Summary of changes

Introduced a way to add new generic query optimizers that work on the query algebra.

Checklist

  • Checked that there aren't other open pull requests for
    the same change.
  • Added tests for any changes that have a runtime impact.
  • Checked that all tests and type checking passes.
  • For changes that have a potential impact on users of this project:
    • Updated relevant documentation to avoid inaccuracies.
    • Considered adding additional documentation.
    • Considered adding an example in ./examples for new features.
    • Considered updating our changelog (CHANGELOG.md).
  • Considered granting push permissions to the PR branch,
    so maintainers can fix minor issues and keep your PR up to date.

First implementation is to make it more likely that VALUES is used on the left side of a join.

Added a test, but the optimizer is not invoked by default at this time
@@ -63,8 +64,9 @@ def update(self, strOrQuery, initBindings={}, initNs={}):


class SPARQLProcessor(Processor):
def __init__(self, graph):
def __init__(self, graph, optimizers: List[SPARQLOptimizer] = None):
Copy link
Member

@aucampia aucampia Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to just accept any callable that maps a query to a query:

So somewhere before this:

_QueryTranslatorType = Callable[[Query],Query]

Then:

Suggested change
def __init__(self, graph, optimizers: List[SPARQLOptimizer] = None):
def __init__(self, graph, query_translators: Optional[List[_QueryTranslatorType]] = None):

That way, users can pass methods or free functions, and even have multiple different translator methods on one class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your tests should still work fine with that.

return query


class ValuesToTheLeftOfTheJoin(SPARQLOptimizer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is valuable, I think it may be better to keep it in rdflib._contrib, as we don't necessarily want to offer the same level of compatibility guarantees as we do for other code.


class ValuesToTheLeftOfTheJoin(SPARQLOptimizer):

def optimize(self, query: Query) -> Query:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def optimize(self, query: Query) -> Query:
@classmethod
def optimize(cls, query: Query) -> Query:

As these methods don't use the class state and are side effect free it is best to make them class methods, that way it is clearer to users that they don't have to be concerned with concurrency issues.

query.algebra = self._optimize_node(main)
return query

def _optimize_node(self, cv: Any) -> Any:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _optimize_node(self, cv: Any) -> Any:
@classmethod
def _optimize_node(cls, cv: Any) -> Any:

@aucampia
Copy link
Member

aucampia commented Apr 9, 2023

@JervenBolleman I made a PR against your branch with some changes: JervenBolleman#5

One problem with taking translators as an argument to SPARQLProcessor.__init__ is that it is a bit awkward to call it, it may be better to take translators as an argument to SPARQLProcessor.query(), that way someone can just pass translators when calling Graph.query().

There may be reasons to not do this also, and keep it in the constructor, but I'm not sure that there is.

CC: @niklasl @westurner @RDFLib/core-reviewers

@JervenBolleman
Copy link
Contributor Author

@aucampia thank you for helping with this.

The reason why I did it via the constructor, is that stores normally know how to optimize queries the best. And if you just want to modify a query before sending it for evaluation, there is no need for the store to have any knowledge of the modification, or do it for the user. My natural habitat being RDF4j I think that the approach of query optimizer pipelines is natural, and they are store constructor based.

@westurner
Copy link
Contributor

westurner commented Apr 11, 2023 via email

@aucampia
Copy link
Member

I will have to double-check how exactly this will be called then by passing it to the store, using just graph with the default store, however, the current approach requires this as far as I can tell:

https://github.com/JervenBolleman/rdflib/blob/4dfaaec2b599748360613f58bd6734da225ec7c5/test/test_sparql/test_translators.py#L65-L94

def test_graph_query(rdfs_graph: Graph):
    requested_query_string = """
    PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?x {
        ?x rdfs:label "subClassOf".
    }
    """

    translated_query_string = """
    PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?x {
        ?x rdfs:label "subPropertyOf".
    }
    """

    requested_query = _prepare_query(requested_query_string)
    translated_query = _prepare_query(translated_query_string)

    translate = Mock(return_value=translated_query)

    processor = SPARQLProcessor(rdfs_graph, translators=[translate])
    result = rdfs_graph.query(requested_query, processor)
    rows = []
    for row in result:
        assert isinstance(row, ResultRow)
        rows.append(row.asdict())

    assert rows == [{"x": RDFS.subPropertyOf}]

    translate.assert_called_once_with(requested_query)

It may be worth adding an example of how users should be using this, and also adding a tests for the idiomatic end-to-end usage.

@JervenBolleman when you have a moment do check JervenBolleman#5 - happy to update it if there is something you want different.

@westurner
Copy link
Contributor

To not have to pass the processor as a positional arg on every query,:

result = rdfs_graph.query(requested_query, processor)

(If it isn't already) couldn't it be:

Store.__init__(*, processor:SPARQLProcessor=None)

store = Store(, processor=processor)
# ...
result = rdfs_graph.query(requested_query)

@aucampia
Copy link
Member

To not have to pass the processor as a positional arg on every query,:

result = rdfs_graph.query(requested_query, processor)

(If it isn't already) couldn't it be:

Store.__init__(*, processor:SPARQLProcessor=None)

store = Store(, processor=processor)
# ...
result = rdfs_graph.query(requested_query)

Will check this weekend if this will work or can be made to work.

@aucampia
Copy link
Member

PRs to V6 is closed until further notice. See this for more details:

@aucampia aucampia added the on hold Progress on this issue is blocked by something. label May 21, 2023
@aucampia
Copy link
Member

PRs to V6 is closed until further notice. See this for more details:

We will be open for PRs again once this is resolved:

@aucampia aucampia removed the on hold Progress on this issue is blocked by something. label Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants