Filtering by equality of string leads to a wrong traversal of the tree #190

Adricu8 · 2023-05-05T10:11:27Z

What went wrong?

When we filter by a string column indexed in qbeast-spark we are finding that instead of traversing a specific branch, we also visit sibling cubes that should in theory be avoided.

Example use-case

spark.sql("select domain from db.some_table where domain='www.example.domain.com").show(false)
Output cubes: Vector(CubeId(2, 0, ), CubeId(2, 1, g), CubeId(2, 2, gw), CubeId(2, 3, gww), CubeId(2, 4, gwwQ), CubeId(2, 3, gwQ), CubeId(2, 4, gwQw), CubeId(2, 4, gwQQ), CubeId(2, 5, gwQQw), CubeId(2, 5, gwQQQ), CubeId(2, 2, gQ), CubeId(2, 3, gQw), CubeId(2, 4, gQww), CubeId(2, 5, gQwww), CubeId(2, 5, gQwwQ), CubeId(2, 4, gQwQ), CubeId(2, 5, gQwQw), CubeId(2, 5, gQwQQ), CubeId(2, 3, gQQ), CubeId(2, 4, gQQw), CubeId(2, 1, A))

The previous query outputs 1 record, yet it reads 160GB (out of 3.5 TB).
For example:
CubeId(2, 4, gwwQ), CubeId(2, 4, gwQw) sibling cubes are visited

2. Branch and commit id:

main branch

3. Spark version:

3.2.2

4. Hadoop version:

3.2.2

5. How are you running Spark?

on kubernetes

The text was updated successfully, but these errors were encountered:

osopardo1 · 2023-05-05T10:29:37Z

Thanks for creating the issue!

After merging #189 we can debug this behaviour better and I could create a reproducible example of this.

osopardo1 · 2023-05-08T09:02:21Z

Example test after merging #189 :

        val data = spark.read
          .format("csv")
          .option("header", "true")
          .option("inferSchema", "true")
          .load("src/test/resources/ecommerce100K_2019_Oct.csv")

        data.write
          .format("qbeast")
          .option("columnsToIndex", "brand, user_id")
          .option("cubeSize", "10000")
          .save(tmpDir)

        val qbeastSnapshot = DeltaQbeastSnapshot(DeltaLog.forTable(spark, tmpDir).snapshot)
        val querySpecBuilder = new QuerySpecBuilder(Seq(expr("brand == 'versace'").expr))
        val queryExecutor = new QueryExecutor(querySpecBuilder, qbeastSnapshot)
        val files = queryExecutor.execute()

        // scalastyle:off
        files.foreach(block => println(block.cube))

The output is:

Q
Qw
QwQ
QwA
Qg
A

I've noticed that this happens when the index involves more than one dimension. In case of indexing only brand, the output selects the branch correctly.

Could it be the root cause of this issue? @alexeiakimov

Jiaweihu08 · 2023-08-22T10:07:02Z

Is this still an issue or should be closed? @osopardo1 @Adricu8

osopardo1 · 2023-10-23T10:01:53Z

This issue is no longer relevant for the current String Indexing implementation in #215.

Closing it.

Adricu8 added the type: bug Something isn't working label May 5, 2023

Adricu8 mentioned this issue May 5, 2023

Add length of encoding to String indexed columns #188

Closed

osopardo1 closed this as completed Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering by equality of string leads to a wrong traversal of the tree #190

Filtering by equality of string leads to a wrong traversal of the tree #190

Adricu8 commented May 5, 2023

osopardo1 commented May 5, 2023 •

edited

Loading

osopardo1 commented May 8, 2023 •

edited

Loading

Jiaweihu08 commented Aug 22, 2023

osopardo1 commented Oct 23, 2023

Filtering by equality of string leads to a wrong traversal of the tree #190

Filtering by equality of string leads to a wrong traversal of the tree #190

Comments

Adricu8 commented May 5, 2023

What went wrong?

Example use-case

2. Branch and commit id:

3. Spark version:

4. Hadoop version:

5. How are you running Spark?

osopardo1 commented May 5, 2023 • edited Loading

osopardo1 commented May 8, 2023 • edited Loading

Jiaweihu08 commented Aug 22, 2023

osopardo1 commented Oct 23, 2023

osopardo1 commented May 5, 2023 •

edited

Loading

osopardo1 commented May 8, 2023 •

edited

Loading