Data misplaced when reading a table that does not have the same field positions as the spark schema #367

azaroui · 2021-04-16T13:49:54Z

I am trying to create a dataset from bigquery table. The table has the same fields as case class but not in the same order.
When creating the dataset, we get columns mapped to the wrong fields.

Given this table:

When loading dataset


case class NestedClass(
  int3: Int,
  int1: Int,
  int2: Int)

case class ClassA(str: String, l1: NestedClass)

 val schema = Encoders.product[ClassA].schema

    val ds2 = spark.read
      .schema(schema)
      .option("table", "customers_sale.test_table")
      .format("com.google.cloud.spark.bigquery")
      .load()
      .as[ClassA]

    ds2.map(_.l1).show(false)

NB: notice that NestedClass has the same fields as table but in different order: (int3, int1, int2) instead of (int1, int2, int3)

we got this:

+----+----+----+
|int1|int2|int3|
+----+----+----+
|2   |3   |1   |
+----+----+----+

We expect to get this

+----+----+----+
|int1|int2|int3|
+----+----+----+
|1   |2   |3   |
+----+----+----+

=> The connector does not use fields name to assign values but correlates field position in case class with the same position in table.

The text was updated successfully, but these errors were encountered:

davidrabinowitz · 2021-04-21T01:30:10Z

It seems that the data has been written correctly into the table. What happens when you read the table without specifying the schema? Also notice that the order of the l1 fields you've specified is int3,int1,int2 which correlates to the values. Can you please fix the order and try again?

azaroui · 2021-04-21T09:02:52Z

It seems that the data has been written correctly into the table. What happens when you read the table without specifying the schema? Also notice that the order of the l1 fields you've specified is int3,int1,int2 which correlates to the values. Can you please fix the order and try again?

I purposely changed the order of the attributes to reproduce the problem i had on our project: the connector does not use the field name to assign the values but their positions.

Without specifying schema we get this:

 Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `l1`.`int3` from bigint to int as it may truncate
 The type path of the target object is:
 - field (class: "scala.Int", name: "int3")
 - field (class: "com.carrefour.phenix.transactions.offline.persister.NestedClass", name: "l1")
 - root class: "com.carrefour.phenix.transactions.offline.persister.ClassA"
 You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
 	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2643)
 	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$38$$anonfun$applyOrElse$11.applyOrElse(Analyzer.scala:2659)
 	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$38$$anonfun$applyOrElse$11.applyOrElse(Analyzer.scala:2654)
 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:258)
 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:258)
 	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
 	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:257)

I updated the question to give more precision

emkornfield · 2021-04-21T18:23:53Z

If the schema here is a nested struct, then I think we have a gap in the conversion logic:

 namesInOrder.stream()
              .map(root::getVector)
              .map(ArrowSchemaConverter::new)
              .toArray(ColumnVector[]::new);

only rearranges top-level columns. I'm not sure if the right metadata is actually passed down by spark to correct this problem though

LaurentValdenaire · 2021-04-27T09:57:05Z

We indeed have an issue with nested structs

… the column order of struct variables need not be same as that of BQ schema

Column order of struct variables doesn't need to be the same as that of BigQuery schema

davidrabinowitz · 2021-05-13T15:36:06Z

Fixed by PR #391

davidrabinowitz self-assigned this Apr 21, 2021

davidrabinowitz added the question Further information is requested label Apr 21, 2021

davidrabinowitz removed the question Further information is requested label May 4, 2021

davidrabinowitz assigned davidrabinowitz and himanshukohli09 and unassigned davidrabinowitz May 6, 2021

himanshukohli09 added a commit to himanshukohli09/spark-bigquery-connector that referenced this issue May 13, 2021

Issue GoogleCloudDataproc#367: Solved column order of struct bug. Now…

59409ac

… the column order of struct variables need not be same as that of BQ schema

davidrabinowitz pushed a commit that referenced this issue May 13, 2021

Issue #367: Solved column order of struct bug (#391)

325e298

Column order of struct variables doesn't need to be the same as that of BigQuery schema

davidrabinowitz closed this as completed May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data misplaced when reading a table that does not have the same field positions as the spark schema #367

Data misplaced when reading a table that does not have the same field positions as the spark schema #367

azaroui commented Apr 16, 2021 •

edited

Loading

davidrabinowitz commented Apr 21, 2021

azaroui commented Apr 21, 2021 •

edited

Loading

emkornfield commented Apr 21, 2021

LaurentValdenaire commented Apr 27, 2021

davidrabinowitz commented May 13, 2021

Data misplaced when reading a table that does not have the same field positions as the spark schema #367

Data misplaced when reading a table that does not have the same field positions as the spark schema #367

Comments

azaroui commented Apr 16, 2021 • edited Loading

davidrabinowitz commented Apr 21, 2021

azaroui commented Apr 21, 2021 • edited Loading

emkornfield commented Apr 21, 2021

LaurentValdenaire commented Apr 27, 2021

davidrabinowitz commented May 13, 2021

azaroui commented Apr 16, 2021 •

edited

Loading

azaroui commented Apr 21, 2021 •

edited

Loading