Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: wrong query when LIKE and NOT LIKE are used together with aggregate functions #39931

Closed
broodfusion opened this issue Mar 11, 2019 · 5 comments · Fixed by #40260
Closed
Assignees
Labels

Comments

@broodfusion
Copy link

Elasticsearch version (bin/elasticsearch --version): 6.6.1

Plugins installed: []

JVM version (java -version):

java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

OS version (uname -a if on a Unix-like system):
macOS Mojave 10.14.3

Description of the problem including expected versus actual behavior:
Not sure if this is a current limitation as I could not find any documentation explicitly stating this issue as described below:

When performing an SQL query using aggregate functions such as COUNT(), AVG() with LIKE and NOT LIKE, the actual performed query is not as expected.

Using the translate API, we can see the issue more clearly in Kibana:

POST /_xpack/sql/translate
{
  "query": "SELECT COUNT(*), station.keyword as station FROM call WHERE station.keyword LIKE 'B 4E%' and station.keyword NOT LIKE 'B 4E D%' GROUP BY station.keyword"
}

In the below response, we see that the wildcard in must and wildcard in must_not are exactly the same. Therefore the expected query data will not be returned correctly.

Response:

{
  "size" : 0,
  "query" : {
    "bool" : {
      "must" : [
        {
          "wildcard" : {
            "station.keyword" : {
              "wildcard" : "B 4E*",
              "boost" : 1.0
            }
          }
        },
        {
          "bool" : {
            "must_not" : [
              {
                "wildcard" : {
                  "station.keyword" : {
                    "wildcard" : "B 4E*",
                    "boost" : 1.0
                  }
                }
              }
            ],
            "adjust_pure_negative" : true,
            "boost" : 1.0
          }
        }
      ],
      "adjust_pure_negative" : true,
      "boost" : 1.0
    }
  },
  "_source" : false,
  "stored_fields" : "_none_",
  "aggregations" : {
    "groupby" : {
      "composite" : {
        "size" : 1000,
        "sources" : [
          {
            "21587" : {
              "terms" : {
                "field" : "station.keyword",
                "missing_bucket" : true,
                "order" : "asc"
              }
            }
          }
        ]
      }
    }
  }
}

Expected:

{
  "size" : 0,
  "query" : {
    "bool" : {
      "must" : [
        {
          "wildcard" : {
            "station.keyword" : {
              "wildcard" : "B 4E*",
              "boost" : 1.0
            }
          }
        },
        {
          "bool" : {
            "must_not" : [
              {
                "wildcard" : {
                  "station.keyword" : {
                    "wildcard" : "B 4E D*",
                    "boost" : 1.0
                  }
                }
              }
            ],
            "adjust_pure_negative" : true,
            "boost" : 1.0
          }
        }
      ],
      "adjust_pure_negative" : true,
      "boost" : 1.0
    }
  },
  "_source" : false,
  "stored_fields" : "_none_",
  "aggregations" : {
    "groupby" : {
      "composite" : {
        "size" : 1000,
        "sources" : [
          {
            "21587" : {
              "terms" : {
                "field" : "station.keyword",
                "missing_bucket" : true,
                "order" : "asc"
              }
            }
          }
        ]
      }
    }
  }
}

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

  1. create an index named call and a field station that has mapping:
"station" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
  1. Seed some data for station (optional)
B 4E Ph 1
B 4E Ph 2
B 4E D Ph 1
  1. Using the translate API, check the result of using the below
POST /_xpack/sql/translate
{
  "query": "SELECT COUNT(*), station.keyword as station FROM call WHERE station.keyword LIKE 'B 4E%' and station.keyword NOT LIKE 'B 4E D%' GROUP BY station.keyword"
}

Provide logs (if relevant):

@colings86 colings86 added the :Analytics/SQL SQL querying label Mar 12, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@costin
Copy link
Member

costin commented Mar 16, 2019

@broodfusion First off, thank you for the detailed issue report!
Can you please upgrade to 6.6.2? It contains two fixes related to both negation and LIKE (#39443) which might address your issue.

FWIW, I've tried your query against master and the query is generated correctly:

 "query" : {
    "bool" : {
      "must" : [
        {
          "query_string" : {
            "query" : "B 4E*",
            "fields" : [
              "first_name^1.0"
            ],
            "type" : "best_fields",
            "default_operator" : "or",
            "max_determinized_states" : 10000,
            "enable_position_increments" : true,
            "fuzziness" : "AUTO",
            "fuzzy_prefix_length" : 0,
            "fuzzy_max_expansions" : 50,
            "phrase_slop" : 0,
            "escape" : false,
            "auto_generate_synonyms_phrase_query" : true,
            "fuzzy_transpositions" : true,
            "boost" : 1.0
          }
        },
        {
          "bool" : {
            "must_not" : [
              {
                "query_string" : {
                  "query" : "B 4E D*",
                  "fields" : [
                    "last_name^1.0"
                  ],
                  "type" : "best_fields",
                  "default_operator" : "or",
                  "max_determinized_states" : 10000,
                  "enable_position_increments" : true,
                  "fuzziness" : "AUTO",
                  "fuzzy_prefix_length" : 0,
                  "fuzzy_max_expansions" : 50,
                  "phrase_slop" : 0,
                  "escape" : false,
                  "auto_generate_synonyms_phrase_query" : true,
                  "fuzzy_transpositions" : true,
                  "boost" : 1.0
                }
              }
            ],
            "adjust_pure_negative" : true,
            "boost" : 1.0
          }
        }
      ],
      "adjust_pure_negative" : true,
      "boost" : 1.0
    }

Thank you.

@broodfusion
Copy link
Author

broodfusion commented Mar 16, 2019

Hi @costin thanks for the response. However it looks like you're not using it with aggregate functions in the above query.

This works fine without aggregate functions, but when used in combination with aggregate functions such as SUM(), COUNT(), etc, it breaks. I just tried it with version 6.6.2 and it still doesn't give the correct output.

Try doing something like this:

POST /_xpack/sql/translate
{
  "query": "SELECT COUNT(*) as count, station.keyword as station FROM call WHERE station.keyword like 'B 4E%' and station.keyword not like 'B 4E D%' group by station.keyword"
}

Output from Kibana

{
  "size" : 0,
  "query" : {
    "bool" : {
      "must" : [
        {
          "query_string" : {
            "query" : "B 4E*",
            "fields" : [
              "station.keyword^1.0"
            ],
            "type" : "best_fields",
            "default_operator" : "or",
            "max_determinized_states" : 10000,
            "enable_position_increments" : true,
            "fuzziness" : "AUTO",
            "fuzzy_prefix_length" : 0,
            "fuzzy_max_expansions" : 50,
            "phrase_slop" : 0,
            "escape" : false,
            "auto_generate_synonyms_phrase_query" : true,
            "fuzzy_transpositions" : true,
            "boost" : 1.0
          }
        },
        {
          "bool" : {
            "must_not" : [
              {
                "query_string" : {
                  "query" : "B 4E*",
                  "fields" : [
                    "station.keyword^1.0"
                  ],
                  "type" : "best_fields",
                  "default_operator" : "or",
                  "max_determinized_states" : 10000,
                  "enable_position_increments" : true,
                  "fuzziness" : "AUTO",
                  "fuzzy_prefix_length" : 0,
                  "fuzzy_max_expansions" : 50,
                  "phrase_slop" : 0,
                  "escape" : false,
                  "auto_generate_synonyms_phrase_query" : true,
                  "fuzzy_transpositions" : true,
                  "boost" : 1.0
                }
              }
            ],
            "adjust_pure_negative" : true,
            "boost" : 1.0
          }
        }
      ],
      "adjust_pure_negative" : true,
      "boost" : 1.0
    }
  },
  "_source" : false,
  "stored_fields" : "_none_",
  "aggregations" : {
    "groupby" : {
      "composite" : {
        "size" : 1000,
        "sources" : [
          {
            "53" : {
              "terms" : {
                "field" : "station.keyword",
                "missing_bucket" : true,
                "order" : "asc"
              }
            }
          }
        ]
      }
    }
  }
}

Thanks for looking into this!

@astefan
Copy link
Contributor

astefan commented Mar 20, 2019

Actually, the presence of GROUP BY station.keyword is enough to trigger the behavior, and not necessarily the use of a COUNT or MAX.

@astefan
Copy link
Contributor

astefan commented Mar 21, 2019

master (8.0.0): 4e5d5af
7.x (7.1.0): 1a5ff05
7.0 (7.0.0) : 67a0116
6.7 (6.7.1): c1999c7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants