Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for OR and IN operator #186

Merged
merged 19 commits into from
May 12, 2023

Conversation

osopardo1
Copy link
Member

@osopardo1 osopardo1 commented Apr 25, 2023

Description

This PR fixes #183 .

Type of change

It includes additional functionalities on the query package that processes the OR query operators, creating several query spaces to iterate and union.

The pipeline is the following:

  1. SparkFilters (a sequence of Expression) are passed to the OTreeIndex from Spark Query Plan.

  2. We create a QuerySpecBuilder.

  3. To filter the files, we load all the revisions from QbeastSnapshot and we build a QuerySpec for each of them.

  4. Now, we can have several QuerySpecs for a single Revision. Since the predicates contain OR's, our queries are divided into smaller filters that can be run independently. To create the QuerySpecs, we do:

    i. Split Query Filters and Weight Filters. Weight filters are those which indicates the size of the sample. Query Filters are those that involve any of the indexed columns.
    ii. Split the disjunctive and conjunctive predicates. The predicates are parsed as a sequence of ANDs. The OR's are parsed into a single Expression.
    iii. Create a single QuerySpec with the conjunctive predicates (AND).
    iv. For each disjunctive predicate (OR), create a different QuerySpec.

  5. We run each query and we union the results.

For the IN predicate, we pre-process the filters in QuerySpecBuilder as follows:

  1. Match the IN filter passed to the datasource.
  2. We can manage IN as a sequence of OR filtering, each one retrieving one slice of the space. But this is not scalable when we have too many values in the set.
  3. A solution is to find a range (min, max) and retrieve the blocks belonging to that QuerySpaceFromTo. When it comes to a String indexed column, finding a range could be difficult because Strings are hashed and, as far as I know, Murmur3Hash does not respect ordering. To solve that, we must:
    • Hash the values inside the IN set.
    • Select the min and max of those values.
    • Initialise the QuerySpaceFromTo.

Checklist:

Here is the list of things you should do before submitting this pull request:

  • New feature / bug fix has been committed following the Contribution guide.
  • Add comments to the code (make it easier for the community!).
  • Add tests.
  • Your branch is updated to the main branch (dependent changes have been merged).

How Has This Been Tested? (Optional)

This has been tested in a separate class io.qbeast.spark.index.query.DisjunctiveQuerySpecTest , but also on io.qbeast.spark.utils.QbeastFilterPushdownTest

Test Configuration:

  • Spark Version: 3.3.0
  • Hadoop Version: 3.3.4
  • Cluster or local? Local

@Adricu8
Copy link
Contributor

Adricu8 commented Apr 25, 2023

Does operator ISIN behave in the same way and is also pushed-down?
In theory, they should be treated in the same way under the hood.
We could add some test to verify it

Example:

SELECT *
FROM my_table
WHERE column1 = 'value1' OR column2 = 'value2' OR column3 = 'value3'

-- AND

SELECT *
FROM my_table
WHERE column1 IN ('value1', 'value2', 'value3')

should be equivalent

@osopardo1
Copy link
Member Author

Does operator ISIN behave in the same way and is also pushed-down? In theory, they should be treated in the same way under the hood. We could add some test to verify it

Example:

SELECT *
FROM my_table
WHERE column1 = 'value1' OR column2 = 'value2' OR column3 = 'value3'

-- AND

SELECT *
FROM my_table
WHERE column1 IN ('value1', 'value2', 'value3')

should be equivalent

Yep, thanks for the suggestion!

@osopardo1
Copy link
Member Author

After a quick test I can confirm that the IN predicate is also pushdown to the OTreeIndex. I would write a function to transform the IN to a AND/OR predicate before building the QuerySpec.

Copy link
Contributor

@Adricu8 Adricu8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes seem good overall! Thanks for the fast work @osopardo1 ! 💯

@codecov
Copy link

codecov bot commented May 2, 2023

Codecov Report

Merging #186 (1e4c87d) into main (06652c5) will decrease coverage by 0.29%.
The diff coverage is 87.17%.

❗ Current head 1e4c87d differs from pull request most recent head 4bd4848. Consider uploading reports for the commit 4bd4848 to get more accurate results

@@            Coverage Diff             @@
##             main     #186      +/-   ##
==========================================
- Coverage   94.07%   93.78%   -0.29%     
==========================================
  Files          84       85       +1     
  Lines        2058     2107      +49     
  Branches      170      175       +5     
==========================================
+ Hits         1936     1976      +40     
- Misses        122      131       +9     
Impacted Files Coverage Δ
.../main/scala/io/qbeast/spark/delta/OTreeIndex.scala 86.79% <ø> (ø)
...c/main/scala/io/qbeast/core/model/QuerySpace.scala 80.95% <42.85%> (-19.05%) ⬇️
...o/qbeast/spark/index/query/QueryFiltersUtils.scala 89.47% <89.47%> (ø)
.../main/scala/io/qbeast/core/model/QbeastBlock.scala 100.00% <100.00%> (ø)
...ala/io/qbeast/spark/delta/IndexStatusBuilder.scala 100.00% <100.00%> (ø)
...cala/io/qbeast/spark/delta/QbeastMetadataSQL.scala 100.00% <100.00%> (ø)
...la/io/qbeast/spark/index/query/QueryExecutor.scala 97.22% <100.00%> (+0.16%) ⬆️
...io/qbeast/spark/index/query/QuerySpecBuilder.scala 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@osopardo1
Copy link
Member Author

osopardo1 commented May 3, 2023

Just a quick summary of the last updates:

  • We encountered some bugs when doing pushdown on the IN predicate in case of String columns.
  • IN predicates are processed as follows:
    • Match the IN filter passed to the datasource.
    • We can manage IN as a sequence of OR filtering, each one retrieving one slice of the space. But this is not scalable when we have too many values in the set.
    • A solution is to find a range (min, max) and retrieve the blocks belonging to that QuerySpaceFromTo.

Problem: String Columns are Hashed before indexing. Hash do not preserve ordering on Strings, so when retrieving a space >= and/or <= than the column, we might miss some records.

Partial solution: pre-process the IN sequence by hashing before retrieving (min, max). This solves the problem with IN predicate, but does not guarantee that filters that explicitely involve >= and <= would retrieve the correct set of rows.

Let's see an example.

  1. domain IN (a, b, c) -> We do the transformation. Let's imagine that each hash corresponds to: (a = 1, b = 5, c = 2) and we choose min and max following the numbers. In this case would be a and b (1 and 5). It's ok because it also includes c.
  2. domain > a and domain < c -> We do not transform the space. Min and max in this case would be a and c (1 and 2). b is not included in the range values. A possible solution is to pre-process those predicates too.
  3. domain > b. In this case, we would search all the values > 5. This would not include c, which has a hash value of 2. I do not have a solution for that, except to return the whole set of files.

@osopardo1 osopardo1 changed the title Support for OR operator Support for OR and IN operator May 4, 2023
@osopardo1
Copy link
Member Author

osopardo1 commented May 4, 2023

Another quick update:

  • After discussion, we think that we should discard Range Predicates on Strings in the moment. We would filter out those Spark Filters that contain any range expression involving strings before initialising the QuerySpace. (Following the example above, filters domain > a and domain < c as well as domain > b would be filtered in memory).

  • We would provide an Spark Option to enhance and use those filters if the users can tolerate approximate results.

Since we cannot ensure all records are scanned in that situation, it is better to separate them for the moment
@osopardo1
Copy link
Member Author

Could any of you approve this PR? So we can move forward on #188

@alexeiakimov @Adricu8

Thank you!

@osopardo1 osopardo1 requested review from alexeiakimov and Adricu8 May 10, 2023 08:23
Copy link
Contributor

@Adricu8 Adricu8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont have more comments on this PR. Thanks Paola

@osopardo1 osopardo1 merged commit 21e2b47 into Qbeast-io:main May 12, 2023
@osopardo1 osopardo1 deleted the 183-support-or-operator branch August 2, 2023 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support OR operator to filter dataframes
3 participants