-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1105] Change Data Feed - PR 4 - UPDATE command #1146
Conversation
- add CDF logic into UpdateCommand.scala - add UpdateCDCSuite.scala GitOrigin-RevId: 97540779859d0c1f3a6207b68b5a929fdc030256
tableName | ||
.map(readDeltaTable(_)) | ||
.getOrElse(readDeltaTableByPath(tempPath)) | ||
.select("key", "value"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some new tests write key
, value
, part
(partition), yet we just want to assert on the key and value for the answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* @param target target we are updating into | ||
* @param updateExpressions the update transformation to perform on the input DataFrame | ||
* @param metrics transaction metrics | ||
* @param dfWithEvaluatedCondition source DataFrame on which we will apply the update condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this DataFrame already has CONDITION_COLUMN_NAME
? Why does the argument description say "on which we will apply the update condition with an additional column"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll change it to "on which we will apply the update expression". the evaluated condition is there (true or false), but the actual update on the row has not been performed.
target, | ||
updateExpressions, | ||
condition, | ||
targetDf.withColumn(CONDITION_COLUMN_NAME, new Column(condition)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are adding the new column here but dropping the column in withUpdatedColumns
. Is it cleaner to just pass the targetDf and let the withUpdatedColumns
build on the targetDf
?
See the project plan at delta-io#1105. This PR adds CDF to the UPDATE command, during which we generate both preimage and postimage CDF data. This PR also adds UpdateCDCSuite which adds basic tests for these CDF changes. As a high-level overview of how this CDF-update operation is performed, when we find a row that satisfies the update condition, we `explode` an array containing the pre-image, post-image, and main-table updated rows. The pre-image and post-image rows are appropriately typed with the corresponding CDF_TYPE, and the main-table updated row has CDF_TYPE `null`. Thus, the first two rows will be written to the cdf parquet file, with the latter is written to standard main-table data parquet file. Closes delta-io#1146 GitOrigin-RevId: 47413c5345bb97c0e1303a7f4d4d06b89c35ab7a
See the project plan at delta-io#1105. This PR adds CDF to the UPDATE command, during which we generate both preimage and postimage CDF data. This PR also adds UpdateCDCSuite which adds basic tests for these CDF changes. As a high-level overview of how this CDF-update operation is performed, when we find a row that satisfies the update condition, we `explode` an array containing the pre-image, post-image, and main-table updated rows. The pre-image and post-image rows are appropriately typed with the corresponding CDF_TYPE, and the main-table updated row has CDF_TYPE `null`. Thus, the first two rows will be written to the cdf parquet file, with the latter is written to standard main-table data parquet file. Closes delta-io#1146 GitOrigin-RevId: 47413c5345bb97c0e1303a7f4d4d06b89c35ab7a
See the project plan at #1105.
This PR adds CDF to the UPDATE command, during which we generate both preimage and postimage CDF data.
This PR also adds UpdateCDCSuite which adds basic tests for these CDF changes.
As a high-level overview of how this CDF-update operation is performed, when we find a row that satisfies the update condition, we
explode
an array containing the pre-image, post-image, and main-table updated rows.The pre-image and post-image rows are appropriately typed with the corresponding CDF_TYPE, and the main-table updated row has CDF_TYPE
null
. Thus, the first two rows will be written to the cdf parquet file, with the latter is written to standard main-table data parquet file.