Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering by equality of string leads to a wrong traversal of the tree #190

Closed
Adricu8 opened this issue May 5, 2023 · 4 comments
Closed
Labels
type: bug Something isn't working

Comments

@Adricu8
Copy link
Contributor

Adricu8 commented May 5, 2023

What went wrong?

When we filter by a string column indexed in qbeast-spark we are finding that instead of traversing a specific branch, we also visit sibling cubes that should in theory be avoided.

Example use-case

spark.sql("select domain from db.some_table where domain='www.example.domain.com").show(false)
Output cubes: Vector(CubeId(2, 0, ), CubeId(2, 1, g), CubeId(2, 2, gw), CubeId(2, 3, gww), CubeId(2, 4, gwwQ), CubeId(2, 3, gwQ), CubeId(2, 4, gwQw), CubeId(2, 4, gwQQ), CubeId(2, 5, gwQQw), CubeId(2, 5, gwQQQ), CubeId(2, 2, gQ), CubeId(2, 3, gQw), CubeId(2, 4, gQww), CubeId(2, 5, gQwww), CubeId(2, 5, gQwwQ), CubeId(2, 4, gQwQ), CubeId(2, 5, gQwQw), CubeId(2, 5, gQwQQ), CubeId(2, 3, gQQ), CubeId(2, 4, gQQw), CubeId(2, 1, A))

The previous query outputs 1 record, yet it reads 160GB (out of 3.5 TB).
For example:
CubeId(2, 4, gwwQ), CubeId(2, 4, gwQw) sibling cubes are visited

2. Branch and commit id:

main branch

3. Spark version:

3.2.2

4. Hadoop version:

3.2.2

5. How are you running Spark?

on kubernetes

@Adricu8 Adricu8 added the type: bug Something isn't working label May 5, 2023
@osopardo1
Copy link
Member

osopardo1 commented May 5, 2023

Thanks for creating the issue!

After merging #189 we can debug this behaviour better and I could create a reproducible example of this.

@osopardo1
Copy link
Member

osopardo1 commented May 8, 2023

Example test after merging #189 :

        val data = spark.read
          .format("csv")
          .option("header", "true")
          .option("inferSchema", "true")
          .load("src/test/resources/ecommerce100K_2019_Oct.csv")

        data.write
          .format("qbeast")
          .option("columnsToIndex", "brand, user_id")
          .option("cubeSize", "10000")
          .save(tmpDir)

        val qbeastSnapshot = DeltaQbeastSnapshot(DeltaLog.forTable(spark, tmpDir).snapshot)
        val querySpecBuilder = new QuerySpecBuilder(Seq(expr("brand == 'versace'").expr))
        val queryExecutor = new QueryExecutor(querySpecBuilder, qbeastSnapshot)
        val files = queryExecutor.execute()

        // scalastyle:off
        files.foreach(block => println(block.cube))

The output is:

Q
Qw
QwQ
QwA
Qg
A

I've noticed that this happens when the index involves more than one dimension. In case of indexing only brand, the output selects the branch correctly.

Could it be the root cause of this issue? @alexeiakimov

@Jiaweihu08
Copy link
Member

Is this still an issue or should be closed? @osopardo1 @Adricu8

@osopardo1
Copy link
Member

This issue is no longer relevant for the current String Indexing implementation in #215.

Closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants