Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update fork branch #6

Merged
merged 34 commits into from
Sep 3, 2020
Merged

update fork branch #6

merged 34 commits into from
Sep 3, 2020

Conversation

JassAbidi
Copy link
Owner

No description provided.

zsxwing and others added 30 commits June 16, 2020 17:54
Add a usage log when we detect an inconsistent list. We need this usage log to set up an alert job for this.

Jenkins

Author: Shixiong Zhu <zsxwing@gmail.com>

GitOrigin-RevId: 7b2a8ed5b56eacff1de820e83714a23d8f057000
- Added support with AdmissionControl trait

new unit test

Author: Tathagata Das <tathagata.das1565@gmail.com>

GitOrigin-RevId: 4d96f3bc4c90a7aedf5594527da88c23f3954483
## What changes were proposed in this pull request?

- as the title says

## How was this patch tested?
all tests

Author: Tathagata Das <tathagata.das1565@gmail.com>

#10658 is resolved by tdas/SC-37900.

GitOrigin-RevId: 8a8994dfd7979185294697bfdd65642a7a230653
…hon/scala

- Modified old tests to work with Spark 3.0 by adding the right spark conf
- Added a new test to test SQL on metastore tables and paths
- Updated integration testing script to run the examples in a fine-grained manner.

Originally done by @rahulsmahadev in #426

Closes #451

Co-authored-by: Rahul Mahadev <rahul.mahadev@databricks.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

Author: Tathagata Das <tathagata.das1565@gmail.com>

#10659 is resolved by tdas/qfwc2lq2.

GitOrigin-RevId: 4247f17fc78e741b01f5ee47cc8b96f776f4b250
 - Move `streamWrite` method to `DeltaCommand` so that it can be used elsewhere as a utility

Author: Pranav Anand <anandpranavv@gmail.com>
Author: Burak Yavuz <brkyvz@gmail.com>

#9806 is resolved by pranavanand/cloneDelta.

GitOrigin-RevId: 68179f9f08b619ade564eb727cc0ee5ba1960a5e
As the title says

Author: Ali Afroozeh <ali.afroozeh@databricks.com>

GitOrigin-RevId: 390f0ba30287117256d441b23ea2c0cf472c90fa
Follow-up to #413

Closes #455

Signed-off-by: Burak Yavuz <brkyvz@gmail.com>

Author: Jacek Laskowski <jacek@japila.pl>

#10720 is resolved by brkyvz/s4v1b75w.

GitOrigin-RevId: 29491b7c64f0ee773bdee5c6721a25b0ff9df4b9
## What changes were proposed in this pull request?

Remove the `isValid` check in DeltaLog creation as it is unnecessary. We already call `update()` while running any Delta query or operation. The result of this call can be equivalent to the same thing that `isValid` accomplishes, which is to invalidate the cached snapshot in the case that the underlying table has been recreated.

This change would save us a couple listFrom calls to the underlying storage system, and save users both money and time.

We should be okay after removing this change, because:
 1. update() is called during analysis when running queries
 2. update() is called right before an OptimisticTransaction and we also have a check to ensure that our commit version has not gone backwards after a transaction
 3. Streaming queries always call `listFrom` from the latest version, and don't need isValid in the first place

## How was this patch tested?

New unit tests

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

#10603 is resolved by brkyvz/isValid.

GitOrigin-RevId: f5955e43fb2d2d6d53d9b213c18e4fc7b4f1f71a
Removes some archaic terminology from the codebase in the interest of making Delta more inclusive. There is more work to be done as in some cases no alternative APIs exist in Spark, but this is a small step forward.

Closes #458

Signed-off-by: Burak Yavuz <brkyvz@gmail.com>

Author: Michael Armbrust <michael@databricks.com>

#10728 is resolved by brkyvz/wimk4v1l.

GitOrigin-RevId: 489d95dee28ba32d2ee831b6292ccc52453ff672
…and version.sbt

The main goal of this Pull Request is to update the version of delta lake to 0.7.0 in the README file and version.sbt

Closes #457

Co-authored-by: mahmoud MEHDI <mahmoud.mehdi@believedigital.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

Author: mahmoud mahdi <mahmoudmahdi24@gmail.com>

#10767 is resolved by tdas/p04zztzp.

GitOrigin-RevId: db895dcf4d04136c43c5259ca056944307ed27c5
Convert to delta command uses Parquet file sink's MetadataLogFileIndex to read files instead of recursively listing files via hdfs.

Current test coverage was sufficient: https://github.com/databricks/runtime/blob/master/sql/core/src/test/scala/com/databricks/sql/transaction/tahoe/ConvertToDeltaSuiteBase.scala#L234

Author: kian-db <kian.ghodoussi@databricks.com>

GitOrigin-RevId: a79f7b185e546c76118ee851de27ea2ad10d513a
## What changes were proposed in this pull request?
The use of the parquet MetadataLog during a CONVERT TO DELTA is now a user flag as opposed to the current use of the metadatalog by default.

## How was this patch tested?
A unit test was written to ensure all data is read (not just streamed data in the metadatalog) when this flag is set to false.

Author: kian-db <kian.ghodoussi@databricks.com>

#10802 is resolved by kian-db/convert_option.

GitOrigin-RevId: 4f12d0488adc123bae353b9fcf8e59d3f2fd1add
… a view.

## What changes were proposed in this pull request?

DESCRIBE DETAIL assumes it's been provided a table, and when given a view name it will simply report that there's no table by that name. This is a bit confusing, and it's easy enough for us to just check before throwing the error.

## How was this patch tested?

new unit test

Author: Jose Torres <joseph.torres@databricks.com>

#10854 is resolved by jose-torres/fixerrmsg2.

GitOrigin-RevId: 08d6872bf5322afaa77e5df924ebd115a8be17d0
…lta table by path

When a Delta table is accessed by path, the latest available state of the table is being used during analysis, however we should be using the latest available state. Otherwise if some external writer changes the table, we will simply hit an error.

Added a unit test

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

GitOrigin-RevId: 8a2a8c0f53d43c8ea6240c61669217c07e23bec1
…without launching a spark job

Author: liwensun <liwen.sun@databricks.com>

GitOrigin-RevId: 0733f71b09091872d42b75fcab13b5655f1744ac
…ary version

## What changes were proposed in this pull request?

Refactors commitLarge to be able to commit the next version with arbitrary actions. `commitLarge` won't retry and will simply throw an exception if a commit exists.

## How was this patch tested?

YOLO

Author: Burak Yavuz <brkyvz@gmail.com>

#10887 is resolved by brkyvz/cloneReplace.

GitOrigin-RevId: 4bd6bd2b95fe585a6e81d25505407dacd685fe3f
This introduces a new feature to query the version of the last commit written by the current SparkSession across **any** Delta table.

The API is a new SQLConf field in the `SparkSession`’s `SessionState`: `spark.databricks.delta.lastCommitVersionInSession`, that the user can query from SQL or any supported language. The user can simply query a new SQLConf field to find an up-to-date (optional) String encoding of the version of the last commit written by that `SparkSession`.

If no commits have been made by the spark session, querying the key will return `None`, and after commits, it will return `Some(<last_commit_version_in_session>)`.

How much do we care about atomicity for this operation? Currently, it's possible that the following occurs:
- A pair of commits are being written concurrently (say `commit(0)` and `commit(1)`)
- The ordering of the commit => update SQLConf could be:
 1. `commit(0)`
 2. `commit(1)`
 3. `setSQLConf(0)`
 4. `setSQLConf(1)`
- A read between steps 3 and 4 would yield an incorrect result. In order to mitigate this behavior, locks would likely need to attach the update operation to the commit.

We have moved the `setSQLConf` as close as possible to the code performing the deltaLog write, thereby closing the section where we would produce invalid reads to a negligible size.

Unit tests

Author: Zach Schuermann <zachary.zvs@gmail.com>

GitOrigin-RevId: f816852145480297ea7244368f5c4c02d3b15dd9
…plicated query

Author: Shixiong Zhu <zsxwing@gmail.com>

GitOrigin-RevId: 3d1d45b6ab7f30dbd7b07841862fec07de319231
…thon tests

This PR creates a temp dir for spark warehouse dir so that we don't leak files in the project directory when tests crash or get interrupted.

- Manually ran `delta.tests.test_sql` locally and confirmed the spark warehouse dir changed to a temp dir.

Author: Shixiong Zhu <zsxwing@gmail.com>

GitOrigin-RevId: e8f4d5c0814b9ef57270c326078e205bab43d6e2
…le case

This is an obvious missing in unit test `testNullCaseMatchedOnly`: `isPartitioned` not set correct.

Closes #475

Signed-off-by: Rahul Mahadev <rahul.mahadev@databricks.com>

Author: Alan Jin <jinlantao@gmail.com>

#11095 is resolved by rahulsmahadev/ip8ggg3x.

GitOrigin-RevId: b847c41a43591fe30715723d91945b4f813a1c39
…nd NOT MATCHED clauses in MERGE

Added a few more tests to verify current merge clause limits, in
preparatoin for unlimited merge clause support in Apache Spark.

Author: Wenchen Fan <wenchen@databricks.com>

GitOrigin-RevId: 9416708cf5e0a0e1a590e46499118b166083f3a0
Author: David Lewis <david.lewis@databricks.com>
 - `logInfo` and `recordDeltaOperation` added to `checkAndRetry`

Author: Pranav Anand <anandpranavv@gmail.com>

GitOrigin-RevId: 93573dd6d9aa8991cbbb0aab72d84ef7bb595492
## What changes were proposed in this pull request?

`com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper` is deprecated in [jackson-moudle-scala 2.10.0](https://github.com/FasterXML/jackson-module-scala/blob/jackson-module-scala-2.10.0/src/main/scala/com/fasterxml/jackson/module/scala/experimental/ScalaObjectMapper.scala#L9). All of codes have been moved to `com.fasterxml.jackson.module.scala.ScalaObjectMapper`. `com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper` is just an empty trait extending `com.fasterxml.jackson.module.scala.ScalaObjectMapper` now.

This PR updates our codes to use the new package name `com.fasterxml.jackson.module.scala`, as `com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper` is removed in jackson-moudle-scala 2.11.0.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <zsxwing@gmail.com>

#11464 is resolved by zsxwing/ScalaObjectMapper.

GitOrigin-RevId: 622c216b5b5bfe7d988208d359a890375b3e9940
Currently, Delta’s MERGE only supports two MATCHED and one NOT MATCHED clauses.

Since the Spark 3.0 SQL parser supports any number of MATCHED and NOT MATCHED clauses in MERGE, this functionality is extended to Delta to allow for any number of MATCHED and NOT MATCHED clauses in Delta’s Merge.

The API largely remains the same, with the removal of the limitation on the number of MATCHED/NOT MATCHED clauses:
```sql
MERGE INTO [db_name.]target_table [AS target_alias]
USING [db_name.]source_table [<time_travel_version>] [AS source_alias]
ON <merge_condition>
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
...
[ WHEN NOT MATCHED [ AND <condition> ]  THEN <not_matched_action> ]
[ WHEN NOT MATCHED [ AND <condition> ]  THEN <not_matched_action> ]
...

with:

<matched_action>  =
  DELETE  |
  UPDATE SET *  |
  UPDATE SET column1 = value1 [, column2 = value2 ...]

<not_matched_action>  =
  INSERT *  |
  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])

<time_travel_version>  =
  TIMESTAMP AS OF timestamp_expression |
  VERSION AS OF version
```

Old unit tests were modified to accommodate new behavior and new tests were added. Additionally, quicksilver benchmarks were used to ensure no performance regressions occurred.

Benchmark comparison with master shows no significant regression in most non-trivial merge cases (cases that take more than 10 seconds)

Author: Zach Schuermann <zachary.zvs@gmail.com>

GitOrigin-RevId: 63fc1628cd9b3009f0208d54661fc923960aa563
## What changes were proposed in this pull request?

commitLarge shouldn't change the protocol, otherwise it causes transaction conflicts with other transactions.
## How was this patch tested?

Unit test

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

#11493 is resolved by brkyvz/cloneRocks.

GitOrigin-RevId: f8b6fe904775be7041ab2324add0d3ec540d17f0
Make `DeltaTable` Serializable so that it can be sent to executors without throwing `NotSerializableException`. However, methods of `DeltaTable` should not be allowed to run on the executors so should throw a clear exception explaining this.

Closes #485 #486

Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

Author: Shixiong Zhu <zsxwing@gmail.com>
Author: Wesley Hoffman <wesleyhoffman109@gmail.com>

#11575 is resolved by zsxwing/jdmirx4o.

GitOrigin-RevId: bb7dd868cb4782a7c09e46d8ad4a20405edbab3d
…th identifiers

## What changes were proposed in this pull request?

Some users connect to Glue as their MetaStore (catalog). When Spark needs to resolve path based identifiers, it first needs to check with the catalog if the user provided table exists, and only then does it tries to resolve if the provided identifier is a path identifier. When accessing Glue though, Glue can easily throw a permission error. This shouldn't block users from accessing path based Delta tables.

## How was this patch tested?

Unit tests in DeltaSuite and ConvertToDeltaSuite

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

#11573 is resolved by brkyvz/glueErrors.

GitOrigin-RevId: b3b49cd38406d1fe0a94864b430b71177a2949c7
…resolution in self-merge

## What changes were proposed in this pull request?
Merge operation through Scala/Python APIs can cause data corruption when the source DataFrame was built using the same DeltaTable instance as that used for invoking merge. That is something like this.
```
val source = deltaTable.toDF.filter(...)
deltaTable.as("t").merge(source.as("s"), "t.key = s.key")
```
This is because `deltaTable.toDF` always returns the same DF with resolved attribute refs, and any source DF generated from that will get the same attribute references. Therefore when the merge plan is generated by the `DeltaMergeBuilder`, both the children (`source` and `target`) have the same refs. All expressions in the merge plan may resolve successfully but still create an ambiguous resolved plan, something like this.

```
'DeltaMergeInto ('t.key = 's.key), [Delete [, ]]
:- SubqueryAlias t
:  +- Relation[key#257L,value#258L] parquet
+- SubqueryAlias s
   +- Filter (key#257L = cast(4 as bigint))
      +- Relation[key#257L,value#258L] parquet
```

This can lead to incorrect results because all the expression may bind to the incorrect columns when executing the merge in MergeIntoCommand. For example, expressions to copy unmodified target rows can copy instead pick up values from correspondingly-named columns in source row (i.e., pick up value of source's `key#257L` instead of target's `key#257L`).

The solution chosen in this PR is to rewrite all the attribute references in the target plan duplicate attribute refs are detected between source and target. Here are the details of this when this rewrite occurs.

- Rewrite target instead of source - Rewriting refs is a non-trivial challenge in the general plans as shown [here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1167). So we choose to rewrite the target plan which is guaranteed to be simple contains only a few known plans items.

- Only when duplicate refs are detected - We do not want to always rewrite the references because there may be expressions provided by the user that are pre-resolved on the source and target (i.e., when using `target("colName")` format to generate expressions) and arbitrary reference rewrite will cause unnecessary resolution errors in unambiguous queries in existing workloads.

When the duplicated refs in target plan are rewritten, if there are pre-resolved merge expressions (conditions and actions) where the same refs are used, then those are likely to be ambiguous. Hence we unresolve those refs and let the standard resolution logic check whether they are ambiguous or not. **This is the part that there is a slight chance of regression - some existing self-merge with pre-resolved expressions used to accidentally produce the correct results will now start failing. However, the chance of such regression is very low.** More specifically, here is the precise characterization of what queries are affected.

- Non-self-merge queries - Not affected as this PR rewrites plans only when duplicate refs.
- Self-merge queries - there are two cases:
  - Without pre-resolved exprs - There are unambiguous self-merge queries that would generate ambiguous plans therefore most likely produce incorrect results. With this PR, those unambiguous queries will always produce unambiguous plans and therefore correct results. With this PR, those unambiguous queries will always produce unambiguous plans and therefore correct results. No risk of failing those workloads, just making them correct.
  - With pre-resolved exprs - The small set of queries that produced ambiguous plans will now throw resolution error. A very small subset of these queries that passed earlier (but now fails) could have accidentally produced correct results. This the tiny risk.

## How was this patch tested?

- Added more tests covering the presence of pre-resolved refs in all merge expressions
- Added tests for self merge, without and with pre-resolved refs.

Author: Tathagata Das <tathagata.das1565@gmail.com>

#11610 is resolved by tdas/SC-44184.

GitOrigin-RevId: 6809a073f3342b83d6591281f40565ad513f49f3
…formation

## What changes were proposed in this pull request?

In this PR we add sections to the Delta Protocol doc around the checkpoint format, essentially describing what format the checkpoint should be in, how the data should be distributed if this is a multi-part checkpoint, and what the schema of the checkpoint file should be. We also note the new columns that must exist in the new protocol, when we upgrade to writer version 3.

I also noticed that the action reconciliation section was missing, therefore added that as well.

## How was this patch tested?

N/A

Author: Burak Yavuz <brkyvz@gmail.com>

#11673 is resolved by brkyvz/protDoc.

GitOrigin-RevId: e6a4bb90875a1a09b24b123eb928993bee838556
brkyvz and others added 4 commits August 18, 2020 16:18
## What changes were proposed in this pull request?

In this PR, we introduce the concept of a default protocol version for Delta tables. As we build new features for Delta, we may have to make protocol breaking changes. Protocol upgrades could be a high-friction process as all readers and/or writers to a Delta table may have to be upgraded after a protocol change.

With the new default configurations, users will be able to create tables that can be accessed the most widely. Only if they decide to use a new feature, only then will the table protocol require an upgrade. If the table is created with the respective table configuration, then we will choose the minimum required protocol version.

During this process, I found some other irregularities, specifically around table creation. REPLACE wasn't really "creating" a new table as it should in the sense that:
 - Nullability information would be dropped (tests for this will be added separately)
 - default configurations for a table through SQL configs were not being picked up
Therefore I introduced a new `updateMetadataForNewTable` method that will set the protocol for the new table as well. I've also moved protocol verification checks into `updateMetadata` for fail-fast behavior and `prepareCommit` as a last check.

We also remove the "Upgrade your protocol" message as it can become annoying (it was initially meant to be).

## How was this patch tested?

Tests in DeltaProtocolVersionSuite

Author: Burak Yavuz <brkyvz@gmail.com>

#11723 is resolved by brkyvz/protocolConf.

GitOrigin-RevId: df07ae9b8c183f4c9732632d8a7cfcf70522f038
…ckpoint format

## What changes were proposed in this pull request?

Clarify that the field `partitionValues_parsed` is required when the table is partitioned. If it is not partitioned, it may be empty.

## How was this patch tested?

Doc change

Author: Burak Yavuz <brkyvz@gmail.com>

#11899 is resolved by brkyvz/protocolDoc2.

GitOrigin-RevId: 22a36fcf9bad33ab1691446b57ae5e0c61e31d62
## What changes were proposed in this pull request?

Fix logging for Delta.update `Some(...)` today which makes it less readable.

Author: Burak Yavuz <burak@databricks.com>
Author: Burak Yavuz <brkyvz@gmail.com>

#11904 is resolved by brkyvz/fixLogging.

GitOrigin-RevId: df69dfc01d267796988b0f0c8abe8ace66eda425
**Goal:** Add the ability to start structured streaming from a user provided version as outlined in #474

**Desc:** This change will introduce `startingVersion` and `startingTimestamp` to the Delta Stream Reader so that users can define a starting point of their stream source without processing the entire table.

**What's the behavior if both startingVersion and startingTimestamp are set?** Both options cannot be provided together. An error will be thrown telling you to choose only one.

**When a query restarts, how does the new options work?** If a query restarts and a checkpoint location is provided, the stream will start from the checkpoints offset as opposed to the provided option.

**If the user uses an old Delta Lake version to read the checkpoint, what will happen?** If we use a newer version of Delta with this option, we will store an offset that looks like: Process up to this version and this index. If the stream fails and someone reverts to an older version of Delta, we would try to process ALL the data in the snapshot up to the end offset which we came up with.

Credit to brkyvz for a large percentage of the work.

Closes #478

Signed-off-by: Burak Yavuz <brkyvz@gmail.com>

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Wesley Hoffman <wesleyhoffman109@gmail.com>

GitOrigin-RevId: c654217a70b9f6bbee6e96a079223a693dd85e8a
@JassAbidi JassAbidi merged commit cf7a9c9 into JassAbidi:master Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.