Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LuceneSailConnection breaks the semantics of a query with COUNT #1229

Closed
KMax opened this issue Dec 26, 2018 · 5 comments
Closed

LuceneSailConnection breaks the semantics of a query with COUNT #1229

KMax opened this issue Dec 26, 2018 · 5 comments
Labels
🐞 bug issue is a bug specification issues related to compliance to standards and external specs
Milestone

Comments

@KMax
Copy link
Contributor

KMax commented Dec 26, 2018

Looks like LuceneSailConnection.addBindingSets is too smart than it should be. Let me explain with an example problem which it causes.

LucenseSailConnection changes the semantics

Query:

SELECT  (COUNT(DISTINCT ?leUri) AS ?searchCount) {
    ?searchR  search:matches  _:b0 .
     _:b0 search:query    "1157847449121" ;
             search:score    ?score .
    {
        { ?searchR  rdf:type  fibo-fnd-org-fm:FormalOrganization }
        UNION
        { ?searchR  rdf:type  fibo-ru:MutualFund }
        BIND(?searchR AS ?leUri)
    }
    UNION
    { ?leUri  fibo-ru:hasRegistrationIdentifier  ?searchR }
    UNION
    { ?leUri  fibo-ru:hasVATRegistrationNumber  ?searchR }
    ?leUri fibo-be-corp-corp:hasDateOfRegistration/fibo-fnd-dt-fd:hasDateValue ?leOgrnDate .
}

Query plan before LuceneSailConnection:

QueryRoot
   Projection
      ProjectionElemList
         ProjectionElem "searchCount"
      Extension
         ExtensionElem (searchCount)
            Count (Distinct)
               Var (name=leUri)
         Group ()
            Join
               Join
                  Join
                     Join
                        Join
                           StatementPattern
                              Var (name=searchR)
                              Var (name=_const_232d65d1_uri, value=http://www.openrdf.org/contrib/lucenesail#matches, anonymous)
                              Var (name=_anon_1, anonymous)
                           StatementPattern
                              Var (name=_anon_1, anonymous)
                              Var (name=_const_802884e6_uri, value=http://www.openrdf.org/contrib/lucenesail#query, anonymous)
                              Var (name=_const_b114c3d0_lit_e2eec718_0, value="1157847449121", anonymous)
                        StatementPattern
                           Var (name=_anon_1, anonymous)
                           Var (name=_const_803caab0_uri, value=http://www.openrdf.org/contrib/lucenesail#score, anonymous)
                           Var (name=score)
                     Union
                        Extension
                           ExtensionElem (leUri)
                              Var (name=searchR)
                           Union
                              StatementPattern
                                 Var (name=searchR)
                                 Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                                 Var (name=_const_d4c4b487_uri, value=https://spec.edmcouncil.org/fibo/ontology/FND/Organizations/FormalOrganization, anonymous)
                              StatementPattern
                                 Var (name=searchR)
                                 Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                                 Var (name=_const_e18566fb_uri, value=https://w3id.org/datafabric.cc/ontologies/fibo-ru#MutualFund, anonymous)
                        Union
                           StatementPattern
                              Var (name=leUri)
                              Var (name=_const_3923275a_uri, value=https://w3id.org/datafabric.cc/ontologies/fibo-ru#hasRegistrationIdentifier, anonymous)
                              Var (name=searchR)
                           StatementPattern
                              Var (name=leUri)
                              Var (name=_const_5135dc53_uri, value=https://w3id.org/datafabric.cc/ontologies/fibo-ru#hasVATRegistrationNumber, anonymous)
                              Var (name=searchR)
                  StatementPattern
                     Var (name=leUri)
                     Var (name=_const_e10cc7af_uri, value=https://spec.edmcouncil.org/fibo/ontology/BE/Corporations/Corporations/hasDateOfRegistration, anonymous)
                     Var (name=_anon_a8d9694f_13f6_4705_ae28_ffab0c29e6e8, anonymous)
               StatementPattern
                  Var (name=_anon_a8d9694f_13f6_4705_ae28_ffab0c29e6e8, anonymous)
                  Var (name=_const_6df3b22b_uri, value=https://spec.edmcouncil.org/fibo/ontology/FND/DatesAndTimes/FinancialDates/hasDateValue, anonymous)
                  Var (name=leOgrnDate)
            GroupElem
               Count (Distinct)
                  Var (name=leUri)

Query plan after LuceneSailConnection:

QueryRoot
   Projection
      ProjectionElemList
         ProjectionElem "searchCount"
      Join
         BindingSetAssignment (org.eclipse.rdf4j.sail.lucene.BindingSetCollection@b7c21f5f)
         Extension
            ExtensionElem (searchCount)
               Count (Distinct)
                  Var (name=leUri)
            Group ()
               Join
                  Join
                     Join
                        Join
                           Join
                              SingletonSet
                              SingletonSet
                           SingletonSet
                        Union
                           Extension
                              ExtensionElem (leUri)
                                 Var (name=searchR)
                              Union
                                 StatementPattern
                                    Var (name=searchR)
                                    Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                                    Var (name=_const_d4c4b487_uri, value=https://spec.edmcouncil.org/fibo/ontology/FND/Organizations/FormalOrganization, anonymous)
                                 StatementPattern
                                    Var (name=searchR)
                                    Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                                    Var (name=_const_e18566fb_uri, value=https://w3id.org/datafabric.cc/ontologies/fibo-ru#MutualFund, anonymous)
                           Union
                              StatementPattern
                                 Var (name=leUri)
                                 Var (name=_const_3923275a_uri, value=https://w3id.org/datafabric.cc/ontologies/fibo-ru#hasRegistrationIdentifier, anonymous)
                                 Var (name=searchR)
                              StatementPattern
                                 Var (name=leUri)
                                 Var (name=_const_5135dc53_uri, value=https://w3id.org/datafabric.cc/ontologies/fibo-ru#hasVATRegistrationNumber, anonymous)
                                 Var (name=searchR)
                     StatementPattern
                        Var (name=leUri)
                        Var (name=_const_e10cc7af_uri, value=https://spec.edmcouncil.org/fibo/ontology/BE/Corporations/Corporations/hasDateOfRegistration, anonymous)
                        Var (name=_anon_a8d9694f_13f6_4705_ae28_ffab0c29e6e8, anonymous)
                  StatementPattern
                     Var (name=_anon_a8d9694f_13f6_4705_ae28_ffab0c29e6e8, anonymous)
                     Var (name=_const_6df3b22b_uri, value=https://spec.edmcouncil.org/fibo/ontology/FND/DatesAndTimes/FinancialDates/hasDateValue, anonymous)
                     Var (name=leOgrnDate)
               GroupElem
                  Count (Distinct)
                     Var (name=leUri)

As you can see LuceneSailConnection sets the bindings resulted from the full text search right after the projection, instead of just replacing the search pattern.

Search pattern from the query:

?searchR  search:matches  _:b0 .
 _:b0 search:query    "1157847449121" ;
        search:score    ?score .

It changes the semantics of the query. The projection produces more than a single solution if the full-text search produces any result.

QueryJoinOptimizer fixes the issue, but query become suboptimal

The QueryJoinOptimizer saves the situation by moving the binding set to the right side of the join. The query plan after the join optimizer:

QueryRoot
   Projection
      ProjectionElemList
         ProjectionElem "searchCount"
      Join
         Extension
            ExtensionElem (searchCount)
               Count (Distinct)
                  Var (name=leUri)
            Group ()
               Join
                  ...
               GroupElem
                  Count (Distinct)
                     Var (name=leUri)
         BindingSetAssignment (org.eclipse.rdf4j.sail.lucene.BindingSetCollection@b7c21f5f)

However, the plan becomes suboptimal, since the results of the full-text search aren't used in the joins as they must be.

What is the fix?

Is there any reason for LuceneSailConnection.addBindingSets being so "smart"? Can it just put the resulted binding sets in the place where the search pattern was?

@barthanssens barthanssens added 🐞 bug issue is a bug specification issues related to compliance to standards and external specs rdf4j-storage labels Dec 28, 2018
@abrokenjester
Copy link
Contributor

Thanks for reporting this @KMax , at first glance your analysis looks spot-on, and I can't immediately think we wouldn't do an in-place swap of the search pattern with its result.

@abrokenjester abrokenjester modified the milestones: 2.5.0, 2.4.3 Jan 8, 2019
@abrokenjester
Copy link
Contributor

I had a look at a quick fix, but this will require some significant refactoring unfortunately.

@abrokenjester abrokenjester removed this from the 2.4.3 milestone Jan 19, 2019
@KMax
Copy link
Contributor Author

KMax commented Mar 22, 2019

I coded (commit) a solution for QuerySpec (the rest of SearchQueryEvaluator subclasses should be refactored) and wrote a test.

Are there any tests for the rest of the subclasses (DistanceQuerySpec and GeoRelationQuerySpec), so I'd refactor them too? Maybe I looked at wrong places.

@abrokenjester
Copy link
Contributor

@KMax I had a look at your commit, looks legit. Would it be a lot of extra work to generalize your override in QuerySpec to all SearchQueryEvaluator implementations? The logic at first glance appears to be transferable.

As for tests: unfortunately unit test coverage for the lucene sail components is low. By all means introduce more unit tests, we'd be grateful :)

We do have a test suite that test the overall functionality and compliance of the lucene sail. But I just had a look and for some reason that abstract test suite is only executed for the Solr and ElasticSearch indexers, not for Lucene itself. Not your problem: I will do a quick fix around that and publish to the master branch.

@KMax
Copy link
Contributor Author

KMax commented Mar 23, 2019

Would it be a lot of extra work to generalize your override in QuerySpec to all SearchQueryEvaluator implementations?

Yea, I'm working on it. Currently, trying to write unit tests based on the docs, so far without success.

We do have a test suite that test the overall functionality and compliance of the lucene sail.

Cool, I'll take a look.

abrokenjester pushed a commit to eclipse-rdf4j/rdf4j-storage that referenced this issue Mar 23, 2019
hmottestad added a commit to eclipse-rdf4j/rdf4j-storage that referenced this issue Apr 6, 2019
* develop:
  eclipse-rdf4j/rdf4j#1384 bug fix
  eclipse-rdf4j/rdf4j#1384 optimize
  formatter
  faster uniqueLang on empty base sail
  eclipse-rdf4j/rdf4j#1384 test and fix
  don't activate quick profile on skipTests
  shacl tests use logback explicitly instead of slf4j-simple
  eclipse-rdf4j/rdf4j#1290 allow to read and delete shapes
  eclipse-rdf4j/rdf4j#1236 clean install succesful
  eclipse-rdf4j/rdf4j#1236 moved test suites for geosparql, serql, shacl, and sparql
  eclipse-rdf4j/rdf4j#1236 moved tests for shacl and spin
  eclipse-rdf4j/rdf4j#1236 moved store and lucene test suites
  eclipse-rdf4j/rdf4j#1236 cleanup
  eclipse-rdf4j/rdf4j#1236 migrating sailmodel compliance [wip]
  eclipse-rdf4j/rdf4j#1236 sail compliance tests to respective modules
  fixed formatting
  eclipse-rdf4j/rdf4j#1229 compliance tests for Lucene itself

Signed-off-by: Håvard Ottestad <hmottestad@gmail.com>

# Conflicts:
#	shacl/src/main/java/org/eclipse/rdf4j/sail/shacl/ShaclSailConnection.java
@abrokenjester abrokenjester added this to the 2.5.3 milestone Jun 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug issue is a bug specification issues related to compliance to standards and external specs
Projects
None yet
Development

No branches or pull requests

3 participants