Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match Error on Filtering indexed String columns #58

Closed
osopardo1 opened this issue Dec 22, 2021 · 1 comment
Closed

Match Error on Filtering indexed String columns #58

osopardo1 opened this issue Dec 22, 2021 · 1 comment
Assignees
Labels
type: bug Something isn't working

Comments

@osopardo1
Copy link
Member

What went wrong?
When doing a query on a string indexed column, the Spark type UTF8String is not recognized by the Transformation method and it throws a Match error.

This is the result of filtering the e-commerce dataset indexed with qbeast by "brand == 'versace'"

versace (of class org.apache.spark.unsafe.types.UTF8String)
scala.MatchError: versace (of class org.apache.spark.unsafe.types.UTF8String)
	at io.qbeast.core.transform.HashTransformation.transform(HashTransformation.scala:11)
	at io.qbeast.core.model.QuerySpaceFromTo$.$anonfun$apply$1(QuerySpace.scala:68)
	at scala.collection.immutable.List.map(List.scala:293)
	at io.qbeast.core.model.QuerySpaceFromTo$.apply(QuerySpace.scala:67)
	at io.qbeast.spark.index.query.QuerySpecBuilder.extractQuerySpace(QuerySpecBuilder.scala:107)
	at io.qbeast.spark.index.query.QuerySpecBuilder.build(QuerySpecBuilder.scala:144)
	at io.qbeast.spark.index.query.QueryExecutor.$anonfun$execute$1(QueryExecutor.scala:22)

The solution is to detect the spark type before calling core functions and parse it to the string representation.

How to reproduce?

  1. Code that triggered the bug, or steps to reproduce:
   val tmpDir = "/tmp/qbeast"

    val data = spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("src/test/resources/ecommerce100K_2019_Oct.csv")
    .distinct().na.drop()

   data.write
      .mode("overwrite")
      .format("qbeast")
      .options(
        Map("columnsToIndex" -> "brand,product_id", "cubeSize" -> "10000"))
      .save(tmpDir)

  val indexed = spark.read.format("qbeast").load(tmpDir)
  indexed.filter("brand == 'versace'").show()
  1. Branch and commit id:

main on c182980
3. Spark version:
On the spark shell run spark.version.

3.1.2

  1. Hadoop version:
    On the spark shell run org.apache.hadoop.util.VersionInfo.getVersion().

3.2.0

  1. Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests in a local computer?

On local computer

  1. Stack trace:
@osopardo1 osopardo1 added the type: bug Something isn't working label Dec 22, 2021
@osopardo1 osopardo1 added the high label Dec 22, 2021
@osopardo1 osopardo1 self-assigned this Dec 22, 2021
@eavilaes
Copy link
Contributor

Closing per #59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants