-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering by equality of string leads to a wrong traversal of the tree #190
Comments
Thanks for creating the issue! After merging #189 we can debug this behaviour better and I could create a reproducible example of this. |
Example test after merging #189 : val data = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("src/test/resources/ecommerce100K_2019_Oct.csv")
data.write
.format("qbeast")
.option("columnsToIndex", "brand, user_id")
.option("cubeSize", "10000")
.save(tmpDir)
val qbeastSnapshot = DeltaQbeastSnapshot(DeltaLog.forTable(spark, tmpDir).snapshot)
val querySpecBuilder = new QuerySpecBuilder(Seq(expr("brand == 'versace'").expr))
val queryExecutor = new QueryExecutor(querySpecBuilder, qbeastSnapshot)
val files = queryExecutor.execute()
// scalastyle:off
files.foreach(block => println(block.cube)) The output is:
I've noticed that this happens when the index involves more than one dimension. In case of indexing only brand, the output selects the branch correctly. Could it be the root cause of this issue? @alexeiakimov |
Is this still an issue or should be closed? @osopardo1 @Adricu8 |
This issue is no longer relevant for the current String Indexing implementation in #215. Closing it. |
What went wrong?
When we filter by a string column indexed in qbeast-spark we are finding that instead of traversing a specific branch, we also visit sibling cubes that should in theory be avoided.
Example use-case
The previous query outputs 1 record, yet it reads 160GB (out of 3.5 TB).
For example:
CubeId(2, 4, gwwQ), CubeId(2, 4, gwQw)
sibling cubes are visited2. Branch and commit id:
main branch
3. Spark version:
3.2.2
4. Hadoop version:
3.2.2
5. How are you running Spark?
on kubernetes
The text was updated successfully, but these errors were encountered: