Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data misplaced when reading a table that does not have the same field positions as the spark schema #367

Closed
azaroui opened this issue Apr 16, 2021 · 5 comments
Assignees

Comments

@azaroui
Copy link

azaroui commented Apr 16, 2021

I am trying to create a dataset from bigquery table. The table has the same fields as case class but not in the same order.
When creating the dataset, we get columns mapped to the wrong fields.

Given this table:
image

When loading dataset


case class NestedClass(
  int3: Int,
  int1: Int,
  int2: Int)

case class ClassA(str: String, l1: NestedClass)

 val schema = Encoders.product[ClassA].schema

    val ds2 = spark.read
      .schema(schema)
      .option("table", "customers_sale.test_table")
      .format("com.google.cloud.spark.bigquery")
      .load()
      .as[ClassA]

    ds2.map(_.l1).show(false)

NB: notice that NestedClass has the same fields as table but in different order: (int3, int1, int2) instead of (int1, int2, int3)

we got this:

+----+----+----+
|int1|int2|int3|
+----+----+----+
|2   |3   |1   |
+----+----+----+

We expect to get this

+----+----+----+
|int1|int2|int3|
+----+----+----+
|1   |2   |3   |
+----+----+----+

=> The connector does not use fields name to assign values but correlates field position in case class with the same position in table.

@davidrabinowitz
Copy link
Member

It seems that the data has been written correctly into the table. What happens when you read the table without specifying the schema? Also notice that the order of the l1 fields you've specified is int3,int1,int2 which correlates to the values. Can you please fix the order and try again?

@davidrabinowitz davidrabinowitz self-assigned this Apr 21, 2021
@davidrabinowitz davidrabinowitz added the question Further information is requested label Apr 21, 2021
@azaroui
Copy link
Author

azaroui commented Apr 21, 2021

It seems that the data has been written correctly into the table. What happens when you read the table without specifying the schema? Also notice that the order of the l1 fields you've specified is int3,int1,int2 which correlates to the values. Can you please fix the order and try again?

I purposely changed the order of the attributes to reproduce the problem i had on our project: the connector does not use the field name to assign the values but their positions.

Without specifying schema we get this:

 Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `l1`.`int3` from bigint to int as it may truncate
 The type path of the target object is:
 - field (class: "scala.Int", name: "int3")
 - field (class: "com.carrefour.phenix.transactions.offline.persister.NestedClass", name: "l1")
 - root class: "com.carrefour.phenix.transactions.offline.persister.ClassA"
 You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
 	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2643)
 	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$38$$anonfun$applyOrElse$11.applyOrElse(Analyzer.scala:2659)
 	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$38$$anonfun$applyOrElse$11.applyOrElse(Analyzer.scala:2654)
 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:258)
 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:258)
 	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
 	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:257)

I updated the question to give more precision

@emkornfield
Copy link
Collaborator

If the schema here is a nested struct, then I think we have a gap in the conversion logic:

 namesInOrder.stream()
              .map(root::getVector)
              .map(ArrowSchemaConverter::new)
              .toArray(ColumnVector[]::new);

only rearranges top-level columns. I'm not sure if the right metadata is actually passed down by spark to correct this problem though

@LaurentValdenaire
Copy link

We indeed have an issue with nested structs

@davidrabinowitz davidrabinowitz removed the question Further information is requested label May 4, 2021
himanshukohli09 added a commit to himanshukohli09/spark-bigquery-connector that referenced this issue May 13, 2021
… the column order of struct variables need not be same as that of BQ schema
davidrabinowitz pushed a commit that referenced this issue May 13, 2021
Column order of struct variables doesn't need to be the same as that of BigQuery schema
@davidrabinowitz
Copy link
Member

Fixed by PR #391

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants