Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equality comparison of two literals with incompatible (but known) datatypes should return false in standard evaluation mode #3947

Closed
jetztgradnet opened this issue May 31, 2022 · 10 comments · Fixed by #4539
Assignees
Labels
📶 enhancement issue is a new feature or improvement 📦 sparql affects the SPARQL parser / engine
Milestone

Comments

@jetztgradnet
Copy link

jetztgradnet commented May 31, 2022

Current Behavior

I'm porting an application originally written for Apache Jena to use RDF4J instead.

One unit test runs a query like this (combined example with data plus query):

prefix ex:      <http://example.com/> 
prefix xsd:        <http://www.w3.org/2001/XMLSchema#> 

SELECT * WHERE {
  # Sample correct data for testing
  VALUES (?sub ?obj) {
  	(ex:wrong1 "value1-en"@de) # 1 error
  	(ex:wrong2 "value1-de"@en)  # 1 error
  	(ex:wrong3 "value NOT")  # 1 error
  	(ex:wrong4 "111"^^xsd:integer)  # 1 error
  }
  FILTER( ?obj NOT IN ( 
        "1"^^xsd:integer ,
        "value1-de"@de ,
        "value1-en"@en , 		
        "value"^^xsd:string ))
 }

I would expect that four solutions are returned, but RDF4J only returns two (the two language strings). Interestingly/strangely, commenting the value with the xsd:integer makes the third value (plain xsd:string) also part of the result.

So it looks like there is some issue when comparing values of different data types in the NOT IN filter clause.

As a slightly related issue, running this as NOT IN (two or more whitespaces between NOT and IN produces an error (the query is generated so there is little control over whitespace). Is RDF4J rather strict here or does the grammar really not allow multiple whitespaces here?

prefix ex:      <http://example.com/> 
prefix xsd:        <http://www.w3.org/2001/XMLSchema#> 

SELECT * WHERE {
  # Sample correct data for testing
  VALUES (?sub ?obj) {
  	(ex:wrong1 "value1-en"@de) # 1 error
  }

  # two whitespace between NOT and IN
  FILTER( ?obj NOT   IN ( 
        "value1-en"@en , 
        "value"^^xsd:string ))
 }

Expected Behavior

  • I would expect that all four solutions are returned by the provided query
  • I would expect that I can also use NOT IN with multiple white space instead of just NOT IN

Steps To Reproduce

No response

Version

4.0.0

Are you interested in contributing a solution yourself?

Perhaps?

Anything else?

No response

@jetztgradnet jetztgradnet added the 🐞 bug issue is a bug label May 31, 2022
@jetztgradnet jetztgradnet changed the title Unexpectedly suppressed values in FILTER clause Unexpectedly suppressed values in FILTER NOT IN clause May 31, 2022
@jetztgradnet jetztgradnet changed the title Unexpectedly suppressed values in FILTER NOT IN clause Issues with FILTER NOT IN clause May 31, 2022
@hmottestad
Copy link
Contributor

hmottestad commented Jun 1, 2022

I wonder if this could be related to #3696 ?

The specification for NOT IN says that it should be equivalent to != so that ?a NOT IN (1,2,3) should be the same as writing (?a != 1) && (?a != 2) && (?a != 3). Does this produce the same result for you?

As for the whitespace I'm not sure how strict we should be.

Edit: According to the grammar specification the amount of white space between NOT and IN should not matter.

@jetztgradnet
Copy link
Author

jetztgradnet commented Jun 1, 2022

Regarding using !=: this is a generated query, so I didn't want to change the original templates producing this query...
But I added a version of this to the unit test below (testFilter_NotEquals_withMixedDatatypes()) and to my surprise it didn't change the behavior, it still only returns two solutions.

A unit test illustrating the two issues:

import static org.junit.Assert.assertEquals;

import org.eclipse.rdf4j.query.TupleQuery;
import org.eclipse.rdf4j.query.TupleQueryResult;
import org.eclipse.rdf4j.repository.Repository;
import org.eclipse.rdf4j.repository.RepositoryConnection;
import org.eclipse.rdf4j.repository.sail.SailRepository;
import org.eclipse.rdf4j.sail.memory.MemoryStore;
import org.junit.Test;

public class FilterTest {

    @Test
    public void testFilter_NotIn_withMixedDatatypes() {
        Repository repo = new SailRepository(new MemoryStore());
        try (RepositoryConnection conn = repo.getConnection()) {
            String queryString = "prefix ex:      <http://example.com/> \n"
                    + "prefix xsd:        <http://www.w3.org/2001/XMLSchema#> \n"
                    + "\n"
                    + "SELECT * WHERE {\n"
                    + "  # Sample correct data for testing\n"
                    + "  VALUES (?sub ?obj) {\n"
                    + "    (ex:wrong1 \"value1-en\"@de) # 1 error\n"
                    + "    (ex:wrong2 \"value1-de\"@en)  # 1 error\n"
                    + "    (ex:wrong3 \"value NOT\")  # 1 error\n"
                    + "    (ex:wrong4 \"111\"^^xsd:integer)  # 1 error\n"
                    + "  }\n"
                    + "  FILTER( ?obj NOT IN ( \n"
                    + "        \"1\"^^xsd:integer ,\n"
                    + "        \"value1-de\"@de ,\n"
                    + "        \"value1-en\"@en ,        \n"
                    + "        \"value\"^^xsd:string ))\n"
                    + " }";
            TupleQuery tupleQuery = conn.prepareTupleQuery(queryString);
            try (TupleQueryResult result = tupleQuery.evaluate()) {
                int count = 0;
                while (result.hasNext()) { // iterate over the result
                    result.next();
                    count++;
                }

                assertEquals("Should have four solutions", 4, count);
            }
        }
    }
    
    @Test
    public void testFilter_NotEquals_withMixedDatatypes() {
        Repository repo = new SailRepository(new MemoryStore());
        try (RepositoryConnection conn = repo.getConnection()) {
            String queryString = "prefix ex:      <http://example.com/> \n"
                    + "prefix xsd:        <http://www.w3.org/2001/XMLSchema#> \n"
                    + "\n"
                    + "SELECT * WHERE {\n"
                    + "  # Sample correct data for testing\n"
                    + "  VALUES (?sub ?obj) {\n"
                    + "    (ex:wrong1 \"value1-en\"@de) # 1 error\n"
                    + "    (ex:wrong2 \"value1-de\"@en)  # 1 error\n"
                    + "    (ex:wrong3 \"value NOT\")  # 1 error\n"
                    + "    (ex:wrong4 \"111\"^^xsd:integer)  # 1 error\n"
                    + "  }\n"
                    + "  FILTER( (\n"
                    + "        (?obj != \"1\"^^xsd:integer)\n"
                    + "     && (?obj != \"value1-de\"@de)\n"
                    + "     && (?obj != \"value1-en\"@en)        \n"
                    + "     && (?obj != \"value\"^^xsd:string )))\n"
                    + " }";
            TupleQuery tupleQuery = conn.prepareTupleQuery(queryString);
            try (TupleQueryResult result = tupleQuery.evaluate()) {
                int count = 0;
                while (result.hasNext()) { // iterate over the result
                    result.next();
                    count++;
                }

                assertEquals("Should have four solutions", 4, count);
            }
        }
    }
    
    @Test
    public void testFilter_NotIn_withSingleWhitespace() {
        Repository repo = new SailRepository(new MemoryStore());
        try (RepositoryConnection conn = repo.getConnection()) {
            String queryString = ""
                    + "prefix ex:  <http://example.com/> \n"
                    + "prefix xsd: <http://www.w3.org/2001/XMLSchema#> \n"
                    + "\n"
                    + "SELECT * WHERE {\n"
                    + "  # Sample correct data for testing\n"
                    + "  VALUES (?sub ?obj) {\n"
                    + "    (ex:wrong1 \"value1-en\"@de) # 1 error\n"
                    + "  }\n"
                    + "\n"
                    + "  # two whitespace between NOT and IN\n"
                    + "  FILTER( ?obj NOT IN ( \n"
                    + "        \"value1-en\"@en , \n"
                    + "        \"value\"^^xsd:string ))\n"
                    + "}";
            TupleQuery tupleQuery = conn.prepareTupleQuery(queryString);
            try (TupleQueryResult result = tupleQuery.evaluate()) {
                int count = 0;
                while (result.hasNext()) { // iterate over the result
                    result.next();
                    count++;
                }

                assertEquals("Should have one solutions", 1, count);
            }
        }
    }
    
    @Test
    public void testFilter_NotIn_withMultipleWhitespace() {
        Repository repo = new SailRepository(new MemoryStore());
        try (RepositoryConnection conn = repo.getConnection()) {
            // the only differences to the previous test case
            // are additional whitespace between NOT and IN
            String queryString = ""
                    + "prefix ex:  <http://example.com/> \n"
                    + "prefix xsd: <http://www.w3.org/2001/XMLSchema#> \n"
                    + "\n"
                    + "SELECT * WHERE {\n"
                    + "  # Sample correct data for testing\n"
                    + "  VALUES (?sub ?obj) {\n"
                    + "    (ex:wrong1 \"value1-en\"@de) # 1 error\n"
                    + "  }\n"
                    + "\n"
                    + "  # two whitespace between NOT and IN\n"
                    + "  FILTER( ?obj NOT   IN ( \n"
                    + "        \"value1-en\"@en , \n"
                    + "        \"value\"^^xsd:string ))\n"
                    + "}";
            TupleQuery tupleQuery = conn.prepareTupleQuery(queryString);
            try (TupleQueryResult result = tupleQuery.evaluate()) {
                int count = 0;
                while (result.hasNext()) { // iterate over the result
                    result.next();
                    count++;
                }

                assertEquals("Should have one solutions", 1, count);
            }
        }
    }

}

@jetztgradnet
Copy link
Author

BTW, this behaves the same in GraphDB (which is based on RDF4J) and Blazegraph, but not in Stardog where all four solutions are returned.

@hmottestad hmottestad added the 📦 sparql affects the SPARQL parser / engine label Jun 1, 2022
@abrokenjester
Copy link
Contributor

Thanks for the thorough report @jetztgradnet . Unless someone else picks it up first, I'll try and take a closer look over the weekend.

@hmottestad hmottestad added this to the 4.0.2 milestone Jun 2, 2022
@abrokenjester abrokenjester self-assigned this Jun 4, 2022
@abrokenjester
Copy link
Contributor

Query execution plan for the original query:

Projection (resultSizeActual=2)
╠══ProjectionElemList
║     ProjectionElem "sub"
║     ProjectionElem "obj"
╚══Filter (resultSizeActual=2)
   ├──And
   │  ╠══Compare (!=)
   │  ║     Var (name=obj)
   │  ║     ValueConstant (value="value")
   │  ╚══Compare (!=)
   │        Var (name=obj)
   │        ValueConstant (value="value1-en"@en)
   └──Filter (resultSizeActual=3)
      ╠══And
      ║  ├──Compare (!=)
      ║  │     Var (name=obj)
      ║  │     ValueConstant (value="value1-de"@de)
      ║  └──Compare (!=)
      ║        Var (name=obj)
      ║        ValueConstant (value="1"^^<http://www.w3.org/2001/XMLSchema#integer>)
      ╚══BindingSetAssignment ([[sub=http://example.com/wrong1;obj="value1-en"@de], [sub=http://example.com/wrong2;obj="value1-de"@en], [sub=http://example.com/wrong3;obj="value NOT"], [sub=http://example.com/wrong4;obj="111"^^<http://www.w3.org/2001/XMLSchema#integer>]]) (resultSizeActual=4)

Looks fairly clear that the filter evaluation is somehow tripping up. I have also doublechecked that it is not the VALUES clause causing the problem: the same result occurs when adding triples to the store and using a basic graph pattern to query.

@abrokenjester
Copy link
Contributor

Stepping through it, it fails on comparisons between datatyped and untyped literals:

"org.eclipse.rdf4j.query.algebra.evaluation.ValueExprEvaluationException: Unable to compare strings with other supported types" This happens when comparing "value NOT" with "1"^^xsd:integer.

@abrokenjester
Copy link
Contributor

abrokenjester commented Jun 5, 2022

Oh here we go. It's one of those cases where the SPARQL spec become near-impenetrable.

The case we have here is that we are comparing two literals: "value NOT" with "1"^^xsd:integer, using the != comparison. Both are literals, but with incompatible datatypes (string, and integer). In the operator mapping table (see https://www.w3.org/TR/sparql11-query/#OperatorMapping), the only mapping that applies is therefore comparison using RDF term-equality (the last row in the table):

A != B RDF term RDF term fn:not(RDFterm-equal(A, B)) xsd:boolean

RDFterm-equal (see https://www.w3.org/TR/sparql11-query/#func-RDFterm-equal) is defined as follows:

Returns TRUE if term1 and term2 are the same RDF term as defined in Resource Description Framework (RDF): Concepts and Abstract Syntax [CONCEPTS]; produces a type error if the arguments are both literal but are not the same RDF term *; returns FALSE otherwise. term1 and term2 are the same if any of the following is true:

(emphasis mine)

Note that in our case, when applying the specs strictly, RDFterm-equal is supposed to return a type error: we have two literals, but they are not the same RDF term (as they are not equivalent literals as defined in the linked section in RDF Concepts).

So, term-equality under a strict interpretation results in a type error. You'd think that because we negate this that maybe then gets coerced to an actual true value (as in : they are not equal because we get a type error), but you'd be wrong. Going back to section 17.3 (operator mapping), we find this:

[...] Some of the operators are associated with nested function expressions, e.g. fn:not(op:numeric-equal(A, B)). Note that per the XPath definitions, fn:not and op:numeric-equal produce an error if their argument is an error. [...]

In other words the negation of a type error is also a type error.

So, in the strict interpretation of the SPARQL 1.1 spec, RDF4J is actually giving you the correct answer here.

Does that mean Jena and Stardog are wrong? Well no. As per section 17.3.1 (operator extensibility):

SPARQL language extensions may provide additional associations between operators and operator functions; this amounts to adding rows to the table above. No additional operator may yield a result that replaces any result other than a type error in the semantics defined above. The consequence of this rule is that SPARQL FILTERs will produce at least the same intermediate bindings after applying a FILTER as an unextended implementation.

So what Jena and Stardog do is extend the minimal compliance in a way that the spec allows for. We can do a similar thing in RDF4J (in fact we already do for some other cases). It's fine to extend operator behavior as long as the only cases you touch were type errors, previously), but I need to take a look at how to best fit this in: if we should tweak the strict evaluation strategy itself, or if we should consider this another case for the extended evaluation strategy to handle.

@jetztgradnet
Copy link
Author

Thanks, Jeen, for the thorough investigation!
My naive expectation here would be that I want to exclude a set of things, but the thing we are looking at is not that thing, so it may be returned after all. Whether the checked term is valid based in its datatype shouldn't matter here IMHO.

But anyway, whatever you consider the right path to follow for RDF4J is ok for me, I will then adjust the unit tests of the application ported from Jena accordingly.

Any suggestion about the error for multiple whitespaces between NOT and IN (i.e. "NOT IN" as opposed to "NOT IN" which comes from a generated query where the NOT is inserted conditionally and is wrapped in whitespace)?

@abrokenjester
Copy link
Contributor

Thanks, Jeen, for the thorough investigation! My naive expectation here would be that I want to exclude a set of things, but the thing we are looking at is not that thing, so it may be returned after all. Whether the checked term is valid based in its datatype shouldn't matter here IMHO.

Oh I fully agree that that is how it should work. It's just that a minimally-conforming implementation of the SPARQL spec doesn't :) . But minimally-conforming is kind of useless to stick to, beyond corner cases like strict validation and/or ensuring that you're 100% sure your query will work on any compliant SPARQL engine.

But anyway, whatever you consider the right path to follow for RDF4J is ok for me, I will then adjust the unit tests of the application ported from Jena accordingly.

Any suggestion about the error for multiple whitespaces between NOT and IN (i.e. "NOT IN" as opposed to "NOT IN" which comes from a generated query where the NOT is inserted conditionally and is wrapped in whitespace)?

I'll take a look at that. It's a bug (additional whitespace should be ignored), but I'll split it out from this ticket, since it's really a separate issue.

@abrokenjester
Copy link
Contributor

I think we may need to reclassify this as an improvement / feature request rather than a bug. I've picked up a related refactoring issue (GH-635 ) that aims to make choosing the mode in which the query engine runs a little easier. Haven't yet decided if I want to finish that first or in parallel just add this feature into the current (somewhat flawed) setup (probably in the ExtendedEvaluationStrategy).

@abrokenjester abrokenjester added 📶 enhancement issue is a new feature or improvement and removed 🐞 bug issue is a bug labels Jun 7, 2022
@abrokenjester abrokenjester modified the milestones: 4.0.3, 4.1.0 Jun 7, 2022
@hmottestad hmottestad added M1 Fixed in milestone 1 and removed M1 Fixed in milestone 1 labels Jul 2, 2022
@hmottestad hmottestad removed this from the 4.1.0 milestone Jul 31, 2022
@hmottestad hmottestad added this to the 4.2.0 milestone Jul 31, 2022
@hmottestad hmottestad modified the milestones: 4.2.0, 4.3.0 Sep 16, 2022
@abrokenjester abrokenjester changed the title Issues with FILTER NOT IN clause Equality comparison of two literals with incompatible (but known) datatypes should return false in standard evaluation mode May 6, 2023
abrokenjester added a commit that referenced this issue May 6, 2023
Instead of a type error, we now return false when comparing two literals
with incompatible (but known) datatypes.
abrokenjester added a commit that referenced this issue May 6, 2023
Instead of a type error, we now return false when comparing two literals
with incompatible (but known) datatypes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📶 enhancement issue is a new feature or improvement 📦 sparql affects the SPARQL parser / engine
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants