Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update fork #7

Merged
merged 26 commits into from
Sep 24, 2020
Merged

update fork #7

merged 26 commits into from
Sep 24, 2020

Conversation

JassAbidi
Copy link
Owner

No description provided.

pranavanand and others added 26 commits September 17, 2020 10:49
 - Most of the diff here is whitespace
 - New method `doCommitRetryIteratively` which calls `doCommit`, if `doCommit` fails, we separately call `checkAndRetry` and loop until the commit succeeds or the `DELTA_MAX_RETRY_COMMIT_ATTEMPTS` (a new conf) is exceeded
 - `doCommit` now does not catch `FileAlreadyExistsException`
 - `checkAndRetry` returns the next commit version to retry at instead of calling `doCommit` directly
 - Added new error to `DeltaErrors` - `maxCommitRetriesExceededException`

 - New test suite `TransactionRetrySuite` that creates a fake log store that just throws an error. It verifies that the `DELTA_MAX_RETRY_COMMIT_ATTEMPTS` works as intended.

Author: Pranav Anand <anandpranavv@gmail.com>
## What changes were proposed in this pull request?

Implement a new format for checkpoints in Delta that contain partition values as parsed columns as part of the checkpoint. This will help in avoiding potential casting bugs when performing partition filters. This new code path will be enabled by default for tables that are upgraded to writer protocol 3. The related code for upgrading the protocol will be introduced in a follow up PR.

## How was this patch tested?

Unit tests

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

#11864 is resolved by brkyvz/checkpointV2New.

GitOrigin-RevId: ad74d3e3b15efc58ce432e56bd197db4411f6c7b
…maps.

## What changes were proposed in this pull request?

Don't allow NOT NULL constraints inside arrays and maps, since they can't be enforced there.

Includes a fallback flag to allow the constraint (but properly unset the nullability flag) since this has been incorrectly allowed for a while.

## How was this patch tested?

new unit tests

Author: Jose Torres <joseph.torres@databricks.com>

#11448 is resolved by jose-torres/notnullarr.

GitOrigin-RevId: 3e84b5b8d7d5cd5e1755fa441d58aaee28ab1677
 - Pass in operation metrics to `commitLarge`

Author: Pranav Anand <anandpranavv@gmail.com>

GitOrigin-RevId: 809dbc5d7a7ffc36a656c68756ef17b28863bb76
…ableV2

## What changes were proposed in this pull request?

Leverage DeltaTableV2 as the constructor of DeltaTable, so that we can understand if the table was created through the `forPath` or `forName` code path.

## How was this patch tested?

Existing unit tests

Author: Burak Yavuz <brkyvz@gmail.com>

#12058 is resolved by brkyvz/cloneStats.

GitOrigin-RevId: 9477ae98467ae115363c66a2ffcce886a0436fd6
…traints.

## What changes were proposed in this pull request?

Just includes the check, reporting that the current version can't enforce constraints for forward compatibility. No actual implementation yet.

## How was this patch tested?

new unit tests

Author: Jose Torres <joseph.torres@databricks.com>

#12115 is resolved by jose-torres/invariants2.

GitOrigin-RevId: b0eb9dfb87694e2ab9ad329fbeba36210b246ebe
Small changes in MergeInto tests

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Zach Schuermann <zachary.zvs@gmail.com>

GitOrigin-RevId: 8e5a072c2ae93efc582ed0af061b813347b6ced2
…uce upgrade API

## What changes were proposed in this pull request?

Introduces the `DeltaTable.upgradeTableProtocol` method for upgrading a Delta table and also revs up the writer version to version 3. With version 3, writers are required to:
 - Write the new checkpoint format
 - Respect Check Constraints when writing to Delta tables

SQL API will be added in a subsequent PR.

## How was this patch tested?

Adds new tests for the checkpoint version and API tests

Author: Burak Yavuz <brkyvz@gmail.com>

#12143 is resolved by brkyvz/protUpgrade.

GitOrigin-RevId: 1cf8504306b2e26e4f1a7038f7de80aeabce932c
## What changes were proposed in this pull request?

Remove another unused configuration.

## How was this patch tested?

Existing tests

Author: Burak Yavuz <brkyvz@gmail.com>

#8658 is resolved by brkyvz/chkSize.

GitOrigin-RevId: 68665da57b716ea7721cfde03898d0b3687237ea
…elete()

This small PR adds a `@since 0.3.0` annotation to `DeltaMergeMatchedActionBuilder.delete()`, which seems to be the only public API function missing this.

I also noticed that this function is missing a `@Evolving` tag. Is this intentional?

CC: @rapoth @suhsteve @imback82

Closes #493

Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Andrew Fogarty <andrew.f.fogarty@gmail.com>

#12210 is resolved by tdas/8wg7e9nq.

GitOrigin-RevId: e55fd064c923186d349c8c6078a3d5fc4ed375b4
## What changes were proposed in this pull request?

Add enforcement of CHECK constraints specified in table properties. These are merged with column-level invariants extracted from the table schema (including both NOT NULL and some single-column CHECK constraints, although those single-column constraints were never exposed in a public API), and then enforced through the existing invariant checker with some modifications.

This doesn't yet include the API to create CHECK constraints.

## How was this patch tested?

new unit tests

Author: Jose Torres <joseph.torres@databricks.com>

#12111 is resolved by jose-torres/invariantstorage.

GitOrigin-RevId: f9cd927127d6432b4bdd8a67acea32ee00fa73ca
## What changes were proposed in this pull request?

* Change to a protocol version check rather than a hard exception in requiredMinimumProtocol, so other metadata changes can be made to a table which has had a constraint added by future versions. (There's still no API to add constraints currently.)

* Unfold InvariantViolationExceptions from their nesting inside SparkException trees for usability.

* Comment the codegen column values extraction.

## How was this patch tested?

Existing tests with slight tweaks

Author: Jose Torres <joseph.torres@databricks.com>

#12246 is resolved by jose-torres/constraintfollow.

GitOrigin-RevId: 20b15c79b628b8d0a1a6ddeff87e93875c7da375
## What changes were proposed in this pull request?

Now that we have the protocol upgrade, we don't need this flag anymore.

## How was this patch tested?

Existing tests

Author: Burak Yavuz <brkyvz@gmail.com>

#12244 is resolved by brkyvz/enableV2.

GitOrigin-RevId: 81022965a3870ae32f168f5c249db1d3c7c5da98
## What changes were proposed in this pull request?

There are some unused variables in commitLarge. Clean them up.

## How was this patch tested?

Minor refactor, no new tests needed.

Author: Burak Yavuz <brkyvz@gmail.com>

#12254 is resolved by brkyvz/optClone.

GitOrigin-RevId: e5100308abc6c3602a88aa00cded5acb8a3ed915
…atches are unconditionally deleted

```scala
  def multipleSourceRowMatchingTargetRowInMergeException(spark: SparkSession): Throwable = {
    new UnsupportedOperationException(
      s"""Cannot perform MERGE as multiple source rows matched and attempted to update the same
         |target row in the Delta table. By SQL semantics of merge, when multiple source rows match
         |on the same target row, the update operation is ambiguous as it is unclear which source
         |should be used to update the matching target row.
         |You can preprocess the source table to eliminate the possibility of multiple matches.
         |Please refer to
         |${generateDocsLink(spark.sparkContext.getConf,
        "/delta-update.html#upsert-into-a-table-using-merge")}""".stripMargin
    )
  }
```

Checking multiple rows matching is to avoid updating the same target row in the Delta table. So for delete only clause (without update clause), we should not throw this exception.

Closes #434

Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Alan Jin <jinlantao@gmail.com>

#12212 is resolved by tdas/wrhspo2s.

GitOrigin-RevId: 32047e408a7cf2734b23449e9016d405444fba9f
…urce (#12269)

Resolves #505

Signed-off-by: Scott Sandre <scott.sandre@databricks.com>

Closes #506

Signed-off-by: Scott Sandre <scott.sandre@databricks.com>

GitOrigin-RevId: 62b4353ef9e8e4f2e52f5f73349bc801d3474693
changes in MergeIntoSuiteBase

Author: Maryann Xue <maryann.xue@gmail.com>

GitOrigin-RevId: a1ec6d604ee9ad0d38ea07d1e7db8eaadee7ba7b
Small fix in DeltaCommand and Snapshot

Author: Yuanjian Li <yuanjian.li@databricks.com>
Removing unused imports from the Python files to keep everything nice and tidy.

Cleaning up of the imports that aren't used, and suppressing the imports that are used as references to other modules, preserving backward compatibility.

Authored-by: HyukjinKwon <gurwls223@apache.org>
…es}ErrorMessage

Resolves: https://databricks.atlassian.net/browse/SC-46515

For the errors `DeltaSourceIgnoreChangesErrorMessage` and `DeltaSourceIgnoreDeleteErrorMessage` we add to the exception message:
- the RemoveFile Action path that caused the exception
- the commit version

The 2 error messages now look like this:
```
s"Detected deleted data (for example, file $removedFile) from streaming source at " +
        s"version $version. This is currently not supported. If you'd like to ignore deletes, " +
        "set the option 'ignoreDeletes' to 'true'."
```
and
```
s"Detected a data update (for example, file $removedFile) in the source table at version " +
        s"$version. This is currently not supported. If you'd like to ignore updates, set the " +
        "option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, " +
        "please restart this query with a fresh checkpoint directory."
```

Inside of `DeltaSourceSuite` added 2 unit tests:
- `SC-46515: deltaSourceIgnoreDeleteError contains removeFile, version`. This covers the scenario where there is an AddFile action & RemoveFile action & `ignoreChanges == false`.
- `SC-46515: deltaSourceIgnoreChangesError contains removeFile, version`. This covers the scenario where there is NO AddFile action & is a RemoveFile action & `ignoreDeletes == false`.

Author: Scott Sandre <scott.sandre@databricks.com>

GitOrigin-RevId: de403937f68bc299b362ab8f9d2f5617c7be12c8
…ng source

## What changes were proposed in this pull request?

Currently when we look up the version for `startingTimestamp`, we will search the latest commit that's **before or at** the timestamp. However, the behavior of `startingTimestamp` should be returning all changes happening **at or after** the timestamp. So we should search the earliest commit that's at or after the timestamp.

Another change is we don't need to require `startingVersion/startingTimestamp` point to a recreatable version. If the json file exists, we should be able to read changes.

## How was this patch tested?

Refactoring the existing tests for `startingVersion` and `startingTimestamp` to cover more edge cases.

## **IMPORTANT** Warmfix instructions

- [X] Other (please describe): Fix an incorrect behavior of a new feature

Author: Shixiong Zhu <zsxwing@gmail.com>

#12333 is resolved by zsxwing/SC-47216.

GitOrigin-RevId: 7918cc10365c95d79f544beb3670f8d7121a79f3
…sAsJson field is not written

We were creating the stats_parsed column in V2 checkpoints by using the "stats" column in AddFile, however V2 checkpoints that do not write the "stats" column, when `writeStatsAsJson = false` no longer have this information. This PR fixes the checkpoint write code to leverage the "stats_parsed" field existing in previous checkpoint information when writing the new checkpoint.

GitOrigin-RevId: 0aeccb076e899c6d8b8399dfdb2cb210ec657983
…n and writeStatsAsStruct for the checkpoint format

We introduce the table properties `delta.checkpoint.writeStatsAsJson` and `delta.checkpoint.writeStatsAsStruct` to decide what to include as part of the checkpoint data. We used to consider the protocol version as a requirement for writing the new checkpoint columns. However, it'll be a better design to have these table properties instead and have the protocol upgrade a way to enforce the selection of these table properties.

We also introduce a SQL conf for users to test whether they would want to opt-in to the new format or not, instead of having to make a table property change that can cause transaction conflicts.

Author: Burak Yavuz <brkyvz@gmail.com>

GitOrigin-RevId: 0d7873424ce5aa7c351e3f9a2ccead47118cef37
- added new `DeltaSQLConf` value `DELTA_WRITE_CHECKSUM_ENABLED`
- if that conf is set to false when `Checksum.scala::writeChecksumFile` is called, then we return right away

GitOrigin-RevId: 5849a60ebf3a5ddbf2a145b9f070e86713be950e
## What changes were proposed in this pull request?

Add a "startingVersion" = "latest" option in Delta. We translate latest to the version 1 after the most recently committed version.

## How was this patch tested?

new unit tests

Author: Jose Torres <joseph.torres@databricks.com>

#12474 is resolved by jose-torres/deltalatest.

GitOrigin-RevId: 1bf158ac3f3376478f955fb0cec63079732bf218
…streaming source

## What changes were proposed in this pull request?

Add a Trigger.Once test that runs a second batch for Delta streaming source.

Author: Shixiong Zhu <zsxwing@gmail.com>

#12471 is resolved by zsxwing/SC-44942.

GitOrigin-RevId: aca0d22f6e2ef2d96a3c9f3b63f4a5930af505d0
@JassAbidi JassAbidi merged commit 4289d09 into JassAbidi:master Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.