Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REPLACE COLUMNS unsupported? #702

Closed
AFFogarty opened this issue Jun 19, 2021 · 8 comments
Closed

REPLACE COLUMNS unsupported? #702

AFFogarty opened this issue Jun 19, 2021 · 8 comments

Comments

@AFFogarty
Copy link
Contributor

AFFogarty commented Jun 19, 2021

The Delta Lake 1.0.0 docs contain an example for replacing columns using ALTER TABLE table_name REPLACE COLUMNS.... However, when I try to run this, I'm getting an exception from DeltaCatalog.

ALTER TABLE table_name REPLACE COLUMNS (col_1 string, col_2 double)

throws:

java.lang.UnsupportedOperationException: Unrecognized column change class org.apache.spark.sql.connector.catalog.TableChange$DeleteColumn. You may be running an out of date Delta Lake version.
	at org.apache.spark.sql.delta.catalog.DeltaCatalog.$anonfun$alterTable$9(DeltaCatalog.scala:504)
	at org.apache.spark.sql.delta.catalog.DeltaCatalog.$anonfun$alterTable$9$adapted(DeltaCatalog.scala:476)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at org.apache.spark.sql.delta.catalog.DeltaCatalog.$anonfun$alterTable$2(DeltaCatalog.scala:476)
	at scala.collection.immutable.Map$Map2.foreach(Map.scala:159)
	at org.apache.spark.sql.delta.catalog.DeltaCatalog.alterTable(DeltaCatalog.scala:431)
	at org.apache.spark.sql.delta.catalog.DeltaCatalog.alterTable(DeltaCatalog.scala:57)
	at org.apache.spark.sql.execution.datasources.v2.AlterTableExec.run(AlterTableExec.scala:37)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:40)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:40)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:46)
	at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610)
        ...

I took a look at DeltaCatalog and confirmed that alterTable() doesn't handle DeleteColumn.

Is this actually a supported scenario? I took a look through the unit tests and there seem to be no tests covering this.

@tdas
Copy link
Contributor

tdas commented Jun 22, 2021

Delta Lake does not support deleting a column. This is an opinionated approach we have taken. We believe that deleting and renaming columns in tables lead to a lot of downstream confusion, and it's easy for folks to shoot themselves in the foot with it - incorrect results, data loss, etc. Hence we do not support it as of now.

@AFFogarty
Copy link
Contributor Author

AFFogarty commented Jun 22, 2021

Delta Lake does not support deleting a column. This is an opinionated approach we have taken. We believe that deleting and renaming columns in tables lead to a lot of downstream confusion, and it's easy for folks to shoot themselves in the foot with it - incorrect results, data loss, etc. Hence we do not support it as of now.

Ah, ok. Thanks @tdas for clarifying. I misunderstood the ‘REPLACE COLUMNS’ example in the docs. I thought it was deleting ‘colA’ but it was actually just reordering it.

@AFFogarty
Copy link
Contributor Author

AFFogarty commented Jun 29, 2021

Apologies if I'm missing something basic, bit I'm reopening this because I still haven't gotten REPLACE COLUMNS to work.

I tried creating a unit test that uses REPLACE COLUMNS to add some columns to a table. The example is based off this example from the docs.

  ddlTest("REPLACE COLUMNS - simple") {
    withDeltaTable(Seq(("a"), ("b")).toDF("colA")) { tableName =>
       sql(s"ALTER TABLE $tableName REPLACE COLUMNS (colC STRING, colB STRUCT<field2:STRING, nested:STRING, field1:STRING>, colA STRING)")
    }
  }

The test keeps throwing:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in adding columns: cola
	at org.apache.spark.sql.delta.schema.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:213)
	at org.apache.spark.sql.delta.commands.AlterTableAddColumnsDeltaCommand.$anonfun$run$7(alterDeltaTableCommands.scala:208)
	at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:77)
	at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:67)
	at org.apache.spark.sql.delta.commands.AlterTableAddColumnsDeltaCommand.recordOperation(alterDeltaTableCommands.scala:163)
	at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:106)
	at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:91)
	at org.apache.spark.sql.delta.commands.AlterTableAddColumnsDeltaCommand.recordDeltaOperation(alterDeltaTableCommands.scala:163)
	at org.apache.spark.sql.delta.commands.AlterTableAddColumnsDeltaCommand.run(alterDeltaTableCommands.scala:170)
	at org.apache.spark.sql.delta.catalog.DeltaCatalog.$anonfun$alterTable$2(DeltaCatalog.scala:442)
	...

It seems like AlterTableAddColumnsDeltaCommand is being invoked which is treating all of the cols in REPLACE COLUMNS (...) as "adds", so the duplicate colA name is causing checkColumnNameDuplication to throw.

Does anyone have a concrete example of REPLACE COLUMNS that works correctly in release 1.0.0?

@AFFogarty AFFogarty reopened this Jun 29, 2021
@jaceklaskowski
Copy link
Contributor

Wouldn't what TD said earlier explain the behaviour in 1.0.0?

Delta Lake does not support deleting a column.

@AFFogarty
Copy link
Contributor Author

AFFogarty commented Jun 29, 2021

Wouldn't what TD said earlier explain the behaviour in 1.0.0?

Delta Lake does not support deleting a column.

Hey @jaceklaskowski , this example isn't deleting colA, it's just adding 2 new columns colB and colC.

@jaceklaskowski
Copy link
Contributor

I see the following in the code:

withDeltaTable(Seq(("a"), ("b")).toDF("colA")) { tableName =>
       sql(s"ALTER TABLE $tableName REPLACE COLUMNS (colC STRING, colB STRUCT<field2:STRING, nested:STRING, field1:STRING>, colA STRING)")
    }

My understanding is that the single-column colA delta table is altered with REPLACE COLUMNS where the name colA is also included and hence this delete before adding (resulting in replacing a column).


What's very interesting though is that ALTER TABLE $tableName REPLACE COLUMNS ends up as AlterTableAddColumnsDeltaCommand (not AlterTableReplaceColumnsDeltaCommand)?!

@jaceklaskowski
Copy link
Contributor

Found it! See this comment in Spark SQL itself (before Delta Lake can do anything to alter / augment the behaviour):

// REPLACE COLUMNS deletes all the existing columns and adds new columns specified.

@dennyglee
Copy link
Contributor

Quick note, we currently have issue #732 to support column drop and rename. Closing this issue for now - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants