Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1105] Change Data Feed - PR 4 - UPDATE command #1146

Closed

Conversation

scottsand-db
Copy link
Collaborator

See the project plan at #1105.

This PR adds CDF to the UPDATE command, during which we generate both preimage and postimage CDF data.

This PR also adds UpdateCDCSuite which adds basic tests for these CDF changes.

As a high-level overview of how this CDF-update operation is performed, when we find a row that satisfies the update condition, we explode an array containing the pre-image, post-image, and main-table updated rows.

The pre-image and post-image rows are appropriately typed with the corresponding CDF_TYPE, and the main-table updated row has CDF_TYPE null. Thus, the first two rows will be written to the cdf parquet file, with the latter is written to standard main-table data parquet file.

- add CDF logic into UpdateCommand.scala
- add UpdateCDCSuite.scala

GitOrigin-RevId: 97540779859d0c1f3a6207b68b5a929fdc030256
@scottsand-db scottsand-db self-assigned this May 19, 2022
tableName
.map(readDeltaTable(_))
.getOrElse(readDeltaTableByPath(tempPath))
.select("key", "value"),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some new tests write key, value, part (partition), yet we just want to assert on the key and value for the answer.

@scottsand-db scottsand-db requested review from tdas and vkorukanti May 19, 2022 23:45
@scottsand-db scottsand-db removed their assignment May 19, 2022
@scottsand-db scottsand-db self-assigned this May 19, 2022
@scottsand-db scottsand-db added the enhancement New feature or request label May 20, 2022
Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* @param target target we are updating into
* @param updateExpressions the update transformation to perform on the input DataFrame
* @param metrics transaction metrics
* @param dfWithEvaluatedCondition source DataFrame on which we will apply the update condition
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this DataFrame already has CONDITION_COLUMN_NAME? Why does the argument description say "on which we will apply the update condition with an additional column"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll change it to "on which we will apply the update expression". the evaluated condition is there (true or false), but the actual update on the row has not been performed.

target,
updateExpressions,
condition,
targetDf.withColumn(CONDITION_COLUMN_NAME, new Column(condition)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are adding the new column here but dropping the column in withUpdatedColumns. Is it cleaner to just pass the targetDf and let the withUpdatedColumns build on the targetDf?

@vkorukanti vkorukanti added this to the 2.0 milestone Jun 28, 2022
jbguerraz pushed a commit to jbguerraz/delta that referenced this pull request Jul 6, 2022
See the project plan at delta-io#1105.

This PR adds CDF to the UPDATE command, during which we generate both preimage and postimage CDF data.

This PR also adds UpdateCDCSuite which adds basic tests for these CDF changes.

As a high-level overview of how this CDF-update operation is performed, when we find a row that satisfies the update condition, we `explode` an array containing the pre-image, post-image, and main-table updated rows.

The pre-image and post-image rows are appropriately typed with the corresponding CDF_TYPE, and the main-table updated row has CDF_TYPE `null`. Thus, the first two rows will be written to the cdf parquet file, with the latter is written to standard main-table data parquet file.

Closes delta-io#1146

GitOrigin-RevId: 47413c5345bb97c0e1303a7f4d4d06b89c35ab7a
jbguerraz pushed a commit to jbguerraz/delta that referenced this pull request Jul 6, 2022
See the project plan at delta-io#1105.

This PR adds CDF to the UPDATE command, during which we generate both preimage and postimage CDF data.

This PR also adds UpdateCDCSuite which adds basic tests for these CDF changes.

As a high-level overview of how this CDF-update operation is performed, when we find a row that satisfies the update condition, we `explode` an array containing the pre-image, post-image, and main-table updated rows.

The pre-image and post-image rows are appropriately typed with the corresponding CDF_TYPE, and the main-table updated row has CDF_TYPE `null`. Thus, the first two rows will be written to the cdf parquet file, with the latter is written to standard main-table data parquet file.

Closes delta-io#1146

GitOrigin-RevId: 47413c5345bb97c0e1303a7f4d4d06b89c35ab7a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request waiting for merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants