update fork branch #6

JassAbidi · 2020-09-03T18:17:24Z

No description provided.

Add a usage log when we detect an inconsistent list. We need this usage log to set up an alert job for this. Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: 7b2a8ed5b56eacff1de820e83714a23d8f057000

- Added support with AdmissionControl trait new unit test Author: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: 4d96f3bc4c90a7aedf5594527da88c23f3954483

## What changes were proposed in this pull request? - as the title says ## How was this patch tested? all tests Author: Tathagata Das <tathagata.das1565@gmail.com> #10658 is resolved by tdas/SC-37900. GitOrigin-RevId: 8a8994dfd7979185294697bfdd65642a7a230653

@rahulsmahadev

…hon/scala - Modified old tests to work with Spark 3.0 by adding the right spark conf - Added a new test to test SQL on metastore tables and paths - Updated integration testing script to run the examples in a fine-grained manner. Originally done by @rahulsmahadev in #426 Closes #451 Co-authored-by: Rahul Mahadev <rahul.mahadev@databricks.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> Author: Tathagata Das <tathagata.das1565@gmail.com> #10659 is resolved by tdas/qfwc2lq2. GitOrigin-RevId: 4247f17fc78e741b01f5ee47cc8b96f776f4b250

- Move `streamWrite` method to `DeltaCommand` so that it can be used elsewhere as a utility Author: Pranav Anand <anandpranavv@gmail.com> Author: Burak Yavuz <brkyvz@gmail.com> #9806 is resolved by pranavanand/cloneDelta. GitOrigin-RevId: 68179f9f08b619ade564eb727cc0ee5ba1960a5e

As the title says Author: Ali Afroozeh <ali.afroozeh@databricks.com> GitOrigin-RevId: 390f0ba30287117256d441b23ea2c0cf472c90fa

Follow-up to #413 Closes #455 Signed-off-by: Burak Yavuz <brkyvz@gmail.com> Author: Jacek Laskowski <jacek@japila.pl> #10720 is resolved by brkyvz/s4v1b75w. GitOrigin-RevId: 29491b7c64f0ee773bdee5c6721a25b0ff9df4b9

## What changes were proposed in this pull request? Remove the `isValid` check in DeltaLog creation as it is unnecessary. We already call `update()` while running any Delta query or operation. The result of this call can be equivalent to the same thing that `isValid` accomplishes, which is to invalidate the cached snapshot in the case that the underlying table has been recreated. This change would save us a couple listFrom calls to the underlying storage system, and save users both money and time. We should be okay after removing this change, because: 1. update() is called during analysis when running queries 2. update() is called right before an OptimisticTransaction and we also have a check to ensure that our commit version has not gone backwards after a transaction 3. Streaming queries always call `listFrom` from the latest version, and don't need isValid in the first place ## How was this patch tested? New unit tests Author: Burak Yavuz <brkyvz@gmail.com> Author: Burak Yavuz <burak@databricks.com> #10603 is resolved by brkyvz/isValid. GitOrigin-RevId: f5955e43fb2d2d6d53d9b213c18e4fc7b4f1f71a

Removes some archaic terminology from the codebase in the interest of making Delta more inclusive. There is more work to be done as in some cases no alternative APIs exist in Spark, but this is a small step forward. Closes #458 Signed-off-by: Burak Yavuz <brkyvz@gmail.com> Author: Michael Armbrust <michael@databricks.com> #10728 is resolved by brkyvz/wimk4v1l. GitOrigin-RevId: 489d95dee28ba32d2ee831b6292ccc52453ff672

…and version.sbt The main goal of this Pull Request is to update the version of delta lake to 0.7.0 in the README file and version.sbt Closes #457 Co-authored-by: mahmoud MEHDI <mahmoud.mehdi@believedigital.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> Author: mahmoud mahdi <mahmoudmahdi24@gmail.com> #10767 is resolved by tdas/p04zztzp. GitOrigin-RevId: db895dcf4d04136c43c5259ca056944307ed27c5

Convert to delta command uses Parquet file sink's MetadataLogFileIndex to read files instead of recursively listing files via hdfs. Current test coverage was sufficient: https://github.com/databricks/runtime/blob/master/sql/core/src/test/scala/com/databricks/sql/transaction/tahoe/ConvertToDeltaSuiteBase.scala#L234 Author: kian-db <kian.ghodoussi@databricks.com> GitOrigin-RevId: a79f7b185e546c76118ee851de27ea2ad10d513a

## What changes were proposed in this pull request? The use of the parquet MetadataLog during a CONVERT TO DELTA is now a user flag as opposed to the current use of the metadatalog by default. ## How was this patch tested? A unit test was written to ensure all data is read (not just streamed data in the metadatalog) when this flag is set to false. Author: kian-db <kian.ghodoussi@databricks.com> #10802 is resolved by kian-db/convert_option. GitOrigin-RevId: 4f12d0488adc123bae353b9fcf8e59d3f2fd1add

… a view. ## What changes were proposed in this pull request? DESCRIBE DETAIL assumes it's been provided a table, and when given a view name it will simply report that there's no table by that name. This is a bit confusing, and it's easy enough for us to just check before throwing the error. ## How was this patch tested? new unit test Author: Jose Torres <joseph.torres@databricks.com> #10854 is resolved by jose-torres/fixerrmsg2. GitOrigin-RevId: 08d6872bf5322afaa77e5df924ebd115a8be17d0

…lta table by path When a Delta table is accessed by path, the latest available state of the table is being used during analysis, however we should be using the latest available state. Otherwise if some external writer changes the table, we will simply hit an error. Added a unit test Author: Burak Yavuz <brkyvz@gmail.com> Author: Burak Yavuz <burak@databricks.com> GitOrigin-RevId: 8a2a8c0f53d43c8ea6240c61669217c07e23bec1

…without launching a spark job Author: liwensun <liwen.sun@databricks.com> GitOrigin-RevId: 0733f71b09091872d42b75fcab13b5655f1744ac

…ary version ## What changes were proposed in this pull request? Refactors commitLarge to be able to commit the next version with arbitrary actions. `commitLarge` won't retry and will simply throw an exception if a commit exists. ## How was this patch tested? YOLO Author: Burak Yavuz <brkyvz@gmail.com> #10887 is resolved by brkyvz/cloneReplace. GitOrigin-RevId: 4bd6bd2b95fe585a6e81d25505407dacd685fe3f

This introduces a new feature to query the version of the last commit written by the current SparkSession across **any** Delta table. The API is a new SQLConf field in the `SparkSession`’s `SessionState`: `spark.databricks.delta.lastCommitVersionInSession`, that the user can query from SQL or any supported language. The user can simply query a new SQLConf field to find an up-to-date (optional) String encoding of the version of the last commit written by that `SparkSession`. If no commits have been made by the spark session, querying the key will return `None`, and after commits, it will return `Some(<last_commit_version_in_session>)`. How much do we care about atomicity for this operation? Currently, it's possible that the following occurs: - A pair of commits are being written concurrently (say `commit(0)` and `commit(1)`) - The ordering of the commit => update SQLConf could be: 1. `commit(0)` 2. `commit(1)` 3. `setSQLConf(0)` 4. `setSQLConf(1)` - A read between steps 3 and 4 would yield an incorrect result. In order to mitigate this behavior, locks would likely need to attach the update operation to the commit. We have moved the `setSQLConf` as close as possible to the code performing the deltaLog write, thereby closing the section where we would produce invalid reads to a negligible size. Unit tests Author: Zach Schuermann <zachary.zvs@gmail.com> GitOrigin-RevId: f816852145480297ea7244368f5c4c02d3b15dd9

…plicated query Author: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: 3d1d45b6ab7f30dbd7b07841862fec07de319231

…thon tests This PR creates a temp dir for spark warehouse dir so that we don't leak files in the project directory when tests crash or get interrupted. - Manually ran `delta.tests.test_sql` locally and confirmed the spark warehouse dir changed to a temp dir. Author: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: e8f4d5c0814b9ef57270c326078e205bab43d6e2

…le case This is an obvious missing in unit test `testNullCaseMatchedOnly`: `isPartitioned` not set correct. Closes #475 Signed-off-by: Rahul Mahadev <rahul.mahadev@databricks.com> Author: Alan Jin <jinlantao@gmail.com> #11095 is resolved by rahulsmahadev/ip8ggg3x. GitOrigin-RevId: b847c41a43591fe30715723d91945b4f813a1c39

…nd NOT MATCHED clauses in MERGE Added a few more tests to verify current merge clause limits, in preparatoin for unlimited merge clause support in Apache Spark. Author: Wenchen Fan <wenchen@databricks.com> GitOrigin-RevId: 9416708cf5e0a0e1a590e46499118b166083f3a0

Author: David Lewis <david.lewis@databricks.com>

- `logInfo` and `recordDeltaOperation` added to `checkAndRetry` Author: Pranav Anand <anandpranavv@gmail.com> GitOrigin-RevId: 93573dd6d9aa8991cbbb0aab72d84ef7bb595492

## What changes were proposed in this pull request? `com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper` is deprecated in [jackson-moudle-scala 2.10.0](https://github.com/FasterXML/jackson-module-scala/blob/jackson-module-scala-2.10.0/src/main/scala/com/fasterxml/jackson/module/scala/experimental/ScalaObjectMapper.scala#L9). All of codes have been moved to `com.fasterxml.jackson.module.scala.ScalaObjectMapper`. `com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper` is just an empty trait extending `com.fasterxml.jackson.module.scala.ScalaObjectMapper` now. This PR updates our codes to use the new package name `com.fasterxml.jackson.module.scala`, as `com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper` is removed in jackson-moudle-scala 2.11.0. ## How was this patch tested? Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> #11464 is resolved by zsxwing/ScalaObjectMapper. GitOrigin-RevId: 622c216b5b5bfe7d988208d359a890375b3e9940

Currently, Delta’s MERGE only supports two MATCHED and one NOT MATCHED clauses. Since the Spark 3.0 SQL parser supports any number of MATCHED and NOT MATCHED clauses in MERGE, this functionality is extended to Delta to allow for any number of MATCHED and NOT MATCHED clauses in Delta’s Merge. The API largely remains the same, with the removal of the limitation on the number of MATCHED/NOT MATCHED clauses: ```sql MERGE INTO [db_name.]target_table [AS target_alias] USING [db_name.]source_table [<time_travel_version>] [AS source_alias] ON <merge_condition> [ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ] [ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ] ... [ WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action> ] [ WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action> ] ... with: <matched_action> = DELETE | UPDATE SET * | UPDATE SET column1 = value1 [, column2 = value2 ...] <not_matched_action> = INSERT * | INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...]) <time_travel_version> = TIMESTAMP AS OF timestamp_expression | VERSION AS OF version ``` Old unit tests were modified to accommodate new behavior and new tests were added. Additionally, quicksilver benchmarks were used to ensure no performance regressions occurred. Benchmark comparison with master shows no significant regression in most non-trivial merge cases (cases that take more than 10 seconds) Author: Zach Schuermann <zachary.zvs@gmail.com> GitOrigin-RevId: 63fc1628cd9b3009f0208d54661fc923960aa563

## What changes were proposed in this pull request? commitLarge shouldn't change the protocol, otherwise it causes transaction conflicts with other transactions. ## How was this patch tested? Unit test Author: Burak Yavuz <brkyvz@gmail.com> Author: Burak Yavuz <burak@databricks.com> #11493 is resolved by brkyvz/cloneRocks. GitOrigin-RevId: f8b6fe904775be7041ab2324add0d3ec540d17f0

Make `DeltaTable` Serializable so that it can be sent to executors without throwing `NotSerializableException`. However, methods of `DeltaTable` should not be allowed to run on the executors so should throw a clear exception explaining this. Closes #485 #486 Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> Author: Shixiong Zhu <zsxwing@gmail.com> Author: Wesley Hoffman <wesleyhoffman109@gmail.com> #11575 is resolved by zsxwing/jdmirx4o. GitOrigin-RevId: bb7dd868cb4782a7c09e46d8ad4a20405edbab3d

…th identifiers ## What changes were proposed in this pull request? Some users connect to Glue as their MetaStore (catalog). When Spark needs to resolve path based identifiers, it first needs to check with the catalog if the user provided table exists, and only then does it tries to resolve if the provided identifier is a path identifier. When accessing Glue though, Glue can easily throw a permission error. This shouldn't block users from accessing path based Delta tables. ## How was this patch tested? Unit tests in DeltaSuite and ConvertToDeltaSuite Author: Burak Yavuz <brkyvz@gmail.com> Author: Burak Yavuz <burak@databricks.com> #11573 is resolved by brkyvz/glueErrors. GitOrigin-RevId: b3b49cd38406d1fe0a94864b430b71177a2949c7

…resolution in self-merge ## What changes were proposed in this pull request? Merge operation through Scala/Python APIs can cause data corruption when the source DataFrame was built using the same DeltaTable instance as that used for invoking merge. That is something like this. ``` val source = deltaTable.toDF.filter(...) deltaTable.as("t").merge(source.as("s"), "t.key = s.key") ``` This is because `deltaTable.toDF` always returns the same DF with resolved attribute refs, and any source DF generated from that will get the same attribute references. Therefore when the merge plan is generated by the `DeltaMergeBuilder`, both the children (`source` and `target`) have the same refs. All expressions in the merge plan may resolve successfully but still create an ambiguous resolved plan, something like this. ``` 'DeltaMergeInto ('t.key = 's.key), [Delete [, ]] :- SubqueryAlias t : +- Relation[key#257L,value#258L] parquet +- SubqueryAlias s +- Filter (key#257L = cast(4 as bigint)) +- Relation[key#257L,value#258L] parquet ``` This can lead to incorrect results because all the expression may bind to the incorrect columns when executing the merge in MergeIntoCommand. For example, expressions to copy unmodified target rows can copy instead pick up values from correspondingly-named columns in source row (i.e., pick up value of source's `key#257L` instead of target's `key#257L`). The solution chosen in this PR is to rewrite all the attribute references in the target plan duplicate attribute refs are detected between source and target. Here are the details of this when this rewrite occurs. - Rewrite target instead of source - Rewriting refs is a non-trivial challenge in the general plans as shown [here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1167). So we choose to rewrite the target plan which is guaranteed to be simple contains only a few known plans items. - Only when duplicate refs are detected - We do not want to always rewrite the references because there may be expressions provided by the user that are pre-resolved on the source and target (i.e., when using `target("colName")` format to generate expressions) and arbitrary reference rewrite will cause unnecessary resolution errors in unambiguous queries in existing workloads. When the duplicated refs in target plan are rewritten, if there are pre-resolved merge expressions (conditions and actions) where the same refs are used, then those are likely to be ambiguous. Hence we unresolve those refs and let the standard resolution logic check whether they are ambiguous or not. **This is the part that there is a slight chance of regression - some existing self-merge with pre-resolved expressions used to accidentally produce the correct results will now start failing. However, the chance of such regression is very low.** More specifically, here is the precise characterization of what queries are affected. - Non-self-merge queries - Not affected as this PR rewrites plans only when duplicate refs. - Self-merge queries - there are two cases: - Without pre-resolved exprs - There are unambiguous self-merge queries that would generate ambiguous plans therefore most likely produce incorrect results. With this PR, those unambiguous queries will always produce unambiguous plans and therefore correct results. With this PR, those unambiguous queries will always produce unambiguous plans and therefore correct results. No risk of failing those workloads, just making them correct. - With pre-resolved exprs - The small set of queries that produced ambiguous plans will now throw resolution error. A very small subset of these queries that passed earlier (but now fails) could have accidentally produced correct results. This the tiny risk. ## How was this patch tested? - Added more tests covering the presence of pre-resolved refs in all merge expressions - Added tests for self merge, without and with pre-resolved refs. Author: Tathagata Das <tathagata.das1565@gmail.com> #11610 is resolved by tdas/SC-44184. GitOrigin-RevId: 6809a073f3342b83d6591281f40565ad513f49f3

…formation ## What changes were proposed in this pull request? In this PR we add sections to the Delta Protocol doc around the checkpoint format, essentially describing what format the checkpoint should be in, how the data should be distributed if this is a multi-part checkpoint, and what the schema of the checkpoint file should be. We also note the new columns that must exist in the new protocol, when we upgrade to writer version 3. I also noticed that the action reconciliation section was missing, therefore added that as well. ## How was this patch tested? N/A Author: Burak Yavuz <brkyvz@gmail.com> #11673 is resolved by brkyvz/protDoc. GitOrigin-RevId: e6a4bb90875a1a09b24b123eb928993bee838556

## What changes were proposed in this pull request? In this PR, we introduce the concept of a default protocol version for Delta tables. As we build new features for Delta, we may have to make protocol breaking changes. Protocol upgrades could be a high-friction process as all readers and/or writers to a Delta table may have to be upgraded after a protocol change. With the new default configurations, users will be able to create tables that can be accessed the most widely. Only if they decide to use a new feature, only then will the table protocol require an upgrade. If the table is created with the respective table configuration, then we will choose the minimum required protocol version. During this process, I found some other irregularities, specifically around table creation. REPLACE wasn't really "creating" a new table as it should in the sense that: - Nullability information would be dropped (tests for this will be added separately) - default configurations for a table through SQL configs were not being picked up Therefore I introduced a new `updateMetadataForNewTable` method that will set the protocol for the new table as well. I've also moved protocol verification checks into `updateMetadata` for fail-fast behavior and `prepareCommit` as a last check. We also remove the "Upgrade your protocol" message as it can become annoying (it was initially meant to be). ## How was this patch tested? Tests in DeltaProtocolVersionSuite Author: Burak Yavuz <brkyvz@gmail.com> #11723 is resolved by brkyvz/protocolConf. GitOrigin-RevId: df07ae9b8c183f4c9732632d8a7cfcf70522f038

…ckpoint format ## What changes were proposed in this pull request? Clarify that the field `partitionValues_parsed` is required when the table is partitioned. If it is not partitioned, it may be empty. ## How was this patch tested? Doc change Author: Burak Yavuz <brkyvz@gmail.com> #11899 is resolved by brkyvz/protocolDoc2. GitOrigin-RevId: 22a36fcf9bad33ab1691446b57ae5e0c61e31d62

## What changes were proposed in this pull request? Fix logging for Delta.update `Some(...)` today which makes it less readable. Author: Burak Yavuz <burak@databricks.com> Author: Burak Yavuz <brkyvz@gmail.com> #11904 is resolved by brkyvz/fixLogging. GitOrigin-RevId: df69dfc01d267796988b0f0c8abe8ace66eda425

**Goal:** Add the ability to start structured streaming from a user provided version as outlined in #474 **Desc:** This change will introduce `startingVersion` and `startingTimestamp` to the Delta Stream Reader so that users can define a starting point of their stream source without processing the entire table. **What's the behavior if both startingVersion and startingTimestamp are set?** Both options cannot be provided together. An error will be thrown telling you to choose only one. **When a query restarts, how does the new options work?** If a query restarts and a checkpoint location is provided, the stream will start from the checkpoints offset as opposed to the provided option. **If the user uses an old Delta Lake version to read the checkpoint, what will happen?** If we use a newer version of Delta with this option, we will store an offset that looks like: Process up to this version and this index. If the stream fails and someone reverts to an older version of Delta, we would try to process ALL the data in the snapshot up to the end offset which we came up with. Credit to brkyvz for a large percentage of the work. Closes #478 Signed-off-by: Burak Yavuz <brkyvz@gmail.com> Author: Burak Yavuz <brkyvz@gmail.com> Author: Wesley Hoffman <wesleyhoffman109@gmail.com> GitOrigin-RevId: c654217a70b9f6bbee6e96a079223a693dd85e8a

zsxwing and others added 30 commits June 16, 2020 17:54

[SC-37893] Add a usage log for Delta commit list inconsistency

83fd1a4

Add a usage log when we detect an inconsistent list. We need this usage log to set up an alert job for this. Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: 7b2a8ed5b56eacff1de820e83714a23d8f057000

[SC-38364][Delta] Added support for trigger once in DeltaSource

d4407d5

- Added support with AdmissionControl trait new unit test Author: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: 4d96f3bc4c90a7aedf5594527da88c23f3954483

Added a method to run a block of code with an acive txn

04f1620

As the title says Author: Ali Afroozeh <ali.afroozeh@databricks.com> GitOrigin-RevId: 390f0ba30287117256d441b23ea2c0cf472c90fa

[DELTA-OSS-EXTERNAL] [MINOR] Fix typos

184ae4f

Follow-up to #413 Closes #455 Signed-off-by: Burak Yavuz <brkyvz@gmail.com> Author: Jacek Laskowski <jacek@japila.pl> #10720 is resolved by brkyvz/s4v1b75w. GitOrigin-RevId: 29491b7c64f0ee773bdee5c6721a25b0ff9df4b9

[SC-37899] Add a method to list the directory locally using LogStore …

857c68a

…without launching a spark job Author: liwensun <liwen.sun@databricks.com> GitOrigin-RevId: 0733f71b09091872d42b75fcab13b5655f1744ac

[SC-35302] Add a test to make sure Describe History can work in a com…

f5cde01

…plicated query Author: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: 3d1d45b6ab7f30dbd7b07841862fec07de319231

Minor changes

5468491

Author: David Lewis <david.lewis@databricks.com>

[WARMFIX][SC-38808] Add more logging to checkAndRetry

d70cf77

- `logInfo` and `recordDeltaOperation` added to `checkAndRetry` Author: Pranav Anand <anandpranavv@gmail.com> GitOrigin-RevId: 93573dd6d9aa8991cbbb0aab72d84ef7bb595492

brkyvz and others added 4 commits August 18, 2020 16:18

JassAbidi merged commit cf7a9c9 into JassAbidi:master Sep 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update fork branch #6

update fork branch #6

JassAbidi commented Sep 3, 2020

update fork branch #6

update fork branch #6

Conversation

JassAbidi commented Sep 3, 2020