forked from delta-io/delta
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update fork #7
Merged
Merged
update fork #7
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Most of the diff here is whitespace - New method `doCommitRetryIteratively` which calls `doCommit`, if `doCommit` fails, we separately call `checkAndRetry` and loop until the commit succeeds or the `DELTA_MAX_RETRY_COMMIT_ATTEMPTS` (a new conf) is exceeded - `doCommit` now does not catch `FileAlreadyExistsException` - `checkAndRetry` returns the next commit version to retry at instead of calling `doCommit` directly - Added new error to `DeltaErrors` - `maxCommitRetriesExceededException` - New test suite `TransactionRetrySuite` that creates a fake log store that just throws an error. It verifies that the `DELTA_MAX_RETRY_COMMIT_ATTEMPTS` works as intended. Author: Pranav Anand <anandpranavv@gmail.com>
## What changes were proposed in this pull request? Implement a new format for checkpoints in Delta that contain partition values as parsed columns as part of the checkpoint. This will help in avoiding potential casting bugs when performing partition filters. This new code path will be enabled by default for tables that are upgraded to writer protocol 3. The related code for upgrading the protocol will be introduced in a follow up PR. ## How was this patch tested? Unit tests Author: Burak Yavuz <brkyvz@gmail.com> Author: Burak Yavuz <burak@databricks.com> #11864 is resolved by brkyvz/checkpointV2New. GitOrigin-RevId: ad74d3e3b15efc58ce432e56bd197db4411f6c7b
…maps. ## What changes were proposed in this pull request? Don't allow NOT NULL constraints inside arrays and maps, since they can't be enforced there. Includes a fallback flag to allow the constraint (but properly unset the nullability flag) since this has been incorrectly allowed for a while. ## How was this patch tested? new unit tests Author: Jose Torres <joseph.torres@databricks.com> #11448 is resolved by jose-torres/notnullarr. GitOrigin-RevId: 3e84b5b8d7d5cd5e1755fa441d58aaee28ab1677
- Pass in operation metrics to `commitLarge` Author: Pranav Anand <anandpranavv@gmail.com> GitOrigin-RevId: 809dbc5d7a7ffc36a656c68756ef17b28863bb76
…ableV2 ## What changes were proposed in this pull request? Leverage DeltaTableV2 as the constructor of DeltaTable, so that we can understand if the table was created through the `forPath` or `forName` code path. ## How was this patch tested? Existing unit tests Author: Burak Yavuz <brkyvz@gmail.com> #12058 is resolved by brkyvz/cloneStats. GitOrigin-RevId: 9477ae98467ae115363c66a2ffcce886a0436fd6
…traints. ## What changes were proposed in this pull request? Just includes the check, reporting that the current version can't enforce constraints for forward compatibility. No actual implementation yet. ## How was this patch tested? new unit tests Author: Jose Torres <joseph.torres@databricks.com> #12115 is resolved by jose-torres/invariants2. GitOrigin-RevId: b0eb9dfb87694e2ab9ad329fbeba36210b246ebe
Small changes in MergeInto tests Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Zach Schuermann <zachary.zvs@gmail.com> GitOrigin-RevId: 8e5a072c2ae93efc582ed0af061b813347b6ced2
…uce upgrade API ## What changes were proposed in this pull request? Introduces the `DeltaTable.upgradeTableProtocol` method for upgrading a Delta table and also revs up the writer version to version 3. With version 3, writers are required to: - Write the new checkpoint format - Respect Check Constraints when writing to Delta tables SQL API will be added in a subsequent PR. ## How was this patch tested? Adds new tests for the checkpoint version and API tests Author: Burak Yavuz <brkyvz@gmail.com> #12143 is resolved by brkyvz/protUpgrade. GitOrigin-RevId: 1cf8504306b2e26e4f1a7038f7de80aeabce932c
## What changes were proposed in this pull request? Remove another unused configuration. ## How was this patch tested? Existing tests Author: Burak Yavuz <brkyvz@gmail.com> #8658 is resolved by brkyvz/chkSize. GitOrigin-RevId: 68665da57b716ea7721cfde03898d0b3687237ea
…elete() This small PR adds a `@since 0.3.0` annotation to `DeltaMergeMatchedActionBuilder.delete()`, which seems to be the only public API function missing this. I also noticed that this function is missing a `@Evolving` tag. Is this intentional? CC: @rapoth @suhsteve @imback82 Closes #493 Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Andrew Fogarty <andrew.f.fogarty@gmail.com> #12210 is resolved by tdas/8wg7e9nq. GitOrigin-RevId: e55fd064c923186d349c8c6078a3d5fc4ed375b4
## What changes were proposed in this pull request? Add enforcement of CHECK constraints specified in table properties. These are merged with column-level invariants extracted from the table schema (including both NOT NULL and some single-column CHECK constraints, although those single-column constraints were never exposed in a public API), and then enforced through the existing invariant checker with some modifications. This doesn't yet include the API to create CHECK constraints. ## How was this patch tested? new unit tests Author: Jose Torres <joseph.torres@databricks.com> #12111 is resolved by jose-torres/invariantstorage. GitOrigin-RevId: f9cd927127d6432b4bdd8a67acea32ee00fa73ca
## What changes were proposed in this pull request? * Change to a protocol version check rather than a hard exception in requiredMinimumProtocol, so other metadata changes can be made to a table which has had a constraint added by future versions. (There's still no API to add constraints currently.) * Unfold InvariantViolationExceptions from their nesting inside SparkException trees for usability. * Comment the codegen column values extraction. ## How was this patch tested? Existing tests with slight tweaks Author: Jose Torres <joseph.torres@databricks.com> #12246 is resolved by jose-torres/constraintfollow. GitOrigin-RevId: 20b15c79b628b8d0a1a6ddeff87e93875c7da375
## What changes were proposed in this pull request? Now that we have the protocol upgrade, we don't need this flag anymore. ## How was this patch tested? Existing tests Author: Burak Yavuz <brkyvz@gmail.com> #12244 is resolved by brkyvz/enableV2. GitOrigin-RevId: 81022965a3870ae32f168f5c249db1d3c7c5da98
## What changes were proposed in this pull request? There are some unused variables in commitLarge. Clean them up. ## How was this patch tested? Minor refactor, no new tests needed. Author: Burak Yavuz <brkyvz@gmail.com> #12254 is resolved by brkyvz/optClone. GitOrigin-RevId: e5100308abc6c3602a88aa00cded5acb8a3ed915
…atches are unconditionally deleted ```scala def multipleSourceRowMatchingTargetRowInMergeException(spark: SparkSession): Throwable = { new UnsupportedOperationException( s"""Cannot perform MERGE as multiple source rows matched and attempted to update the same |target row in the Delta table. By SQL semantics of merge, when multiple source rows match |on the same target row, the update operation is ambiguous as it is unclear which source |should be used to update the matching target row. |You can preprocess the source table to eliminate the possibility of multiple matches. |Please refer to |${generateDocsLink(spark.sparkContext.getConf, "/delta-update.html#upsert-into-a-table-using-merge")}""".stripMargin ) } ``` Checking multiple rows matching is to avoid updating the same target row in the Delta table. So for delete only clause (without update clause), we should not throw this exception. Closes #434 Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Alan Jin <jinlantao@gmail.com> #12212 is resolved by tdas/wrhspo2s. GitOrigin-RevId: 32047e408a7cf2734b23449e9016d405444fba9f
changes in MergeIntoSuiteBase Author: Maryann Xue <maryann.xue@gmail.com> GitOrigin-RevId: a1ec6d604ee9ad0d38ea07d1e7db8eaadee7ba7b
Small fix in DeltaCommand and Snapshot Author: Yuanjian Li <yuanjian.li@databricks.com>
Removing unused imports from the Python files to keep everything nice and tidy. Cleaning up of the imports that aren't used, and suppressing the imports that are used as references to other modules, preserving backward compatibility. Authored-by: HyukjinKwon <gurwls223@apache.org>
…es}ErrorMessage Resolves: https://databricks.atlassian.net/browse/SC-46515 For the errors `DeltaSourceIgnoreChangesErrorMessage` and `DeltaSourceIgnoreDeleteErrorMessage` we add to the exception message: - the RemoveFile Action path that caused the exception - the commit version The 2 error messages now look like this: ``` s"Detected deleted data (for example, file $removedFile) from streaming source at " + s"version $version. This is currently not supported. If you'd like to ignore deletes, " + "set the option 'ignoreDeletes' to 'true'." ``` and ``` s"Detected a data update (for example, file $removedFile) in the source table at version " + s"$version. This is currently not supported. If you'd like to ignore updates, set the " + "option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, " + "please restart this query with a fresh checkpoint directory." ``` Inside of `DeltaSourceSuite` added 2 unit tests: - `SC-46515: deltaSourceIgnoreDeleteError contains removeFile, version`. This covers the scenario where there is an AddFile action & RemoveFile action & `ignoreChanges == false`. - `SC-46515: deltaSourceIgnoreChangesError contains removeFile, version`. This covers the scenario where there is NO AddFile action & is a RemoveFile action & `ignoreDeletes == false`. Author: Scott Sandre <scott.sandre@databricks.com> GitOrigin-RevId: de403937f68bc299b362ab8f9d2f5617c7be12c8
…ng source ## What changes were proposed in this pull request? Currently when we look up the version for `startingTimestamp`, we will search the latest commit that's **before or at** the timestamp. However, the behavior of `startingTimestamp` should be returning all changes happening **at or after** the timestamp. So we should search the earliest commit that's at or after the timestamp. Another change is we don't need to require `startingVersion/startingTimestamp` point to a recreatable version. If the json file exists, we should be able to read changes. ## How was this patch tested? Refactoring the existing tests for `startingVersion` and `startingTimestamp` to cover more edge cases. ## **IMPORTANT** Warmfix instructions - [X] Other (please describe): Fix an incorrect behavior of a new feature Author: Shixiong Zhu <zsxwing@gmail.com> #12333 is resolved by zsxwing/SC-47216. GitOrigin-RevId: 7918cc10365c95d79f544beb3670f8d7121a79f3
…sAsJson field is not written We were creating the stats_parsed column in V2 checkpoints by using the "stats" column in AddFile, however V2 checkpoints that do not write the "stats" column, when `writeStatsAsJson = false` no longer have this information. This PR fixes the checkpoint write code to leverage the "stats_parsed" field existing in previous checkpoint information when writing the new checkpoint. GitOrigin-RevId: 0aeccb076e899c6d8b8399dfdb2cb210ec657983
…n and writeStatsAsStruct for the checkpoint format We introduce the table properties `delta.checkpoint.writeStatsAsJson` and `delta.checkpoint.writeStatsAsStruct` to decide what to include as part of the checkpoint data. We used to consider the protocol version as a requirement for writing the new checkpoint columns. However, it'll be a better design to have these table properties instead and have the protocol upgrade a way to enforce the selection of these table properties. We also introduce a SQL conf for users to test whether they would want to opt-in to the new format or not, instead of having to make a table property change that can cause transaction conflicts. Author: Burak Yavuz <brkyvz@gmail.com> GitOrigin-RevId: 0d7873424ce5aa7c351e3f9a2ccead47118cef37
- added new `DeltaSQLConf` value `DELTA_WRITE_CHECKSUM_ENABLED` - if that conf is set to false when `Checksum.scala::writeChecksumFile` is called, then we return right away GitOrigin-RevId: 5849a60ebf3a5ddbf2a145b9f070e86713be950e
## What changes were proposed in this pull request? Add a "startingVersion" = "latest" option in Delta. We translate latest to the version 1 after the most recently committed version. ## How was this patch tested? new unit tests Author: Jose Torres <joseph.torres@databricks.com> #12474 is resolved by jose-torres/deltalatest. GitOrigin-RevId: 1bf158ac3f3376478f955fb0cec63079732bf218
…streaming source ## What changes were proposed in this pull request? Add a Trigger.Once test that runs a second batch for Delta streaming source. Author: Shixiong Zhu <zsxwing@gmail.com> #12471 is resolved by zsxwing/SC-44942. GitOrigin-RevId: aca0d22f6e2ef2d96a3c9f3b63f4a5930af505d0
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.