Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on sampling when using <columnName>:<type> in columnsToIndex #352

Closed
osopardo1 opened this issue Jul 23, 2024 · 0 comments · Fixed by #355
Closed

Error on sampling when using <columnName>:<type> in columnsToIndex #352

osopardo1 opened this issue Jul 23, 2024 · 0 comments · Fixed by #355
Assignees
Labels
type: bug Something isn't working

Comments

@osopardo1
Copy link
Member

osopardo1 commented Jul 23, 2024

What went wrong?

When doing a tablesample over a table with histogram type column, the following error appears:


None.get
java.util.NoSuchElementException: None.get
	at scala.None$.get(Option.scala:529)
	at scala.None$.get(Option.scala:527)
	at io.qbeast.spark.internal.rules.SampleRule.$anonfun$transformSampleToFilter$1(SampleRule.scala:77)

Due to the bad parsing of columnsToIndex property in :

object QbeastRelation {
def unapply(plan: LogicalPlan): Option[(LogicalRelation, IndexedColumns)] = plan match {
case l @ LogicalRelation(
q @ HadoopFsRelation(o: DefaultFileIndex, _, _, _, _, parameters),
_,
_,
_) =>
val columnsToIndex = parameters("columnsToIndex")
Some((l, columnsToIndex.split(",")))
case _ => None
}

How to reproduce?

Different steps about how to reproduce the problem.

1. Code that triggered the bug, or steps to reproduce:

val data = 1.to(10).map(i => (i, s"$i")).toDF("id", "name")
data.write
  .format("qbeast")
  .option("columnsToIndex", "id,name:histogram")
  .option("cubeSize", "100")
  .option("columnStats", s"""{"id_min": 0, "id_max": 100}""")
  .saveAsTable("qbeast")

spark.sql("SELECT * FROM qbeast TABLESAMPLE(10 PERCENT)").show(false)

2. Branch and commit id:

main at 9b47ef5

3. Spark version:

On the spark shell run spark.version.

3.5.0

4. Hadoop version:

On the spark shell run org.apache.hadoop.util.VersionInfo.getVersion().

3.3.4

5. How are you running Spark?

Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests in a local computer?

Local

6. Stack trace:

Trace of the log/error messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant