update fork #9

JassAbidi · 2021-05-01T21:43:15Z

No description provided.

Author: Linhong Liu <linhong.liu@databricks.com> (cherry picked from commit 3d79e78ee2fd05936ffb87b67ec1039e26257ba5) Signed-off-by: Ubuntu <ubuntu@ip-10-110-16-101.us-west-2.compute.internal> GitOrigin-RevId: 6524e412ea882e46e3d15db4a9ba22eee6eec125

This PR adds a test so that we can detect #618 Author: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: 1af03b03f64c607c8f61b41eef678f3a72355ad7

Author: herman <herman@databricks.com> GitOrigin-RevId: b03bd9be547625516b0cef80522322af85a4d05a

Author: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: 1cc8f84c0c5a3c04910feed934d3ad66b869ae77

Author: yaohua <yaohua.zhao@databricks.com> GitOrigin-RevId: bd103eeeb672424b8133aea732b9d66381f4d77f

Author: liwensun <liwen.sun@databricks.com> GitOrigin-RevId: d48e075d5b6cdc55bf2b139016b3399ecfccf10c

…ocol specification Add missing fields in the RemoveFile of the protocol specification Closes #613 Signed-off-by: Rahul Mahadev <rahul.mahadev@databricks.com> Author: fvaleye <florian.valeye@gmail.com> #19691 is resolved by rahulsmahadev/yhhem4p5. GitOrigin-RevId: bf3b646a83d830be1b04d83e2cdb566f744dfd39

Before this PR, when we create a checkpoint, we use the snapshot established at the transaction start and create a checkpoint for that version. This might cause multiple transactions to checkpoint a same version and potentially lead to corrupted checkpoint status. In this PR, we change to checkpoint the version committed by this transaction to avoid such scenario. The downside of this approach is that we pay extra cost to get the new snapshot occasionally. We expect this is a rare case and we can tolerate the cost. Existing tests. Author: Meng Tong <meng.tong@databricks.com> GitOrigin-RevId: 1858f58f69e0618924709631b67f81fe1d0b863d

Currently Delta allows some safe type change such as from SMALLINT to INT. However, this may break generated column contract. For example, let's say we have a column c1, and a generated column c2 defined as `CAST(hash(c1 + 32767s) AS SMALLINT)`. When c1's type is SMALLINT and we insert `32767s`, the expression will return 31349, but if c1's type is INT and we insert `32767`, the expression will return 9876. This means changing the column type may require to rewrite the existing data. But since it's too heavy, we can simply disallow it. New unit test Author: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: 93a0475fed83ec6a751e4b05902aec3fa71410a5

Improve vacuum logging Add unit test Author: Rahul Mahadev <rahul.mahadev@databricks.com> GitOrigin-RevId: 44ffdb72030de6ac6aadb9590238effe43bbaf4d

Converting all the predicates into CNF may result in a very long predicate and the codegen become unnecessarily large. We should follow the approach of apache/spark#29101, which extracts all convertiable predicates gracefully. Author: Gengliang Wang <gengliang.wang@databricks.com> GitOrigin-RevId: 9ecedbd83f85b8235262c3de0bd83b4540cbc560

Author: Gengliang Wang <ltnwgl@gmail.com> Author: Gengliang Wang <gengliang.wang@databricks.com> Author: Wenchen Fan <cloud0fan@gmail.com> Author: Wenchen Fan <wenchen@databricks.com> Author: Kris Mok <kris.mok@databricks.com> GitOrigin-RevId: 1d6c55bcd5232af4a0a486dfc3bb97afcc439f1f

…n DeltaFileOperations Added two methods, `recursiveListFrom` and `localListFrom`, into `DeltaFileOperations`. Currently, the only way to list files is to specify a directory path, then all files under that directory will be listed. This is wasteful if we only want _new_ files after a certain filename, instead of needing to re-list the entire directory and filter thereafter. These two methods allow you to specify a directory name and a path (in that directory) from which to list from. Then, taking advantage of `LogStore.listFrom`, only files with filenames after the specified path will be returned. - Added tests in `DeltaFileOperationsSuite`. Author: Howard Xiao <howard.xiao@databricks.com> GitOrigin-RevId: af5712e60224835cf00e247c6a9f740e04126968

Authored-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit be888b27edfbb0d7ebb2265de1bf74acb8d3d09a) Author: Wenchen Fan <wenchen@databricks.com> GitOrigin-RevId: ff2bea17c03ac092693c1db4610d800889b8ea49

…mited clauses Schema evolution didn't originally work with unlimited clauses. We need a test to prevent this from regressing in the future. Right now there's just a Scala suite test, since the unlimited clauses test harness doesn't support evolution and the evolution test harness doesn't support unlimited clauses, so it's complicated to write a test that the ACL and CDC extensions of the SQL merge suite will correctly be able to skip. new unit test Author: Jose Torres <joseph.torres@databricks.com> GitOrigin-RevId: 4e492459966fafb23d1d5b3d4fea95656d02cf55

Add Delta Lake cheat sheet to `/examples/cheat_sheet/`. Closes #628 Co-authored-by: Brenner Heintz <brenner.heintz@gmail.com> Signed-off-by: Meng Tong <meng.tong@databricks.com> Author: brennerh1 <65046554+brennerh1@users.noreply.github.com> GitOrigin-RevId: 1086666e5ccf56841f8c5e32d94af61ca14913ff

Author: Lars Kroll <lars.kroll@databricks.com> GitOrigin-RevId: 33a4fcdf50af72096e800bc5b2b4cc45476cb735

Author: Rahul Mahadev <rahul.mahadev@databricks.com> GitOrigin-RevId: e7e94e70ff2795a81ec73f12ba0a91ccf730056c

…for unlimited clauses" Author: Stefan Zeiger <stefan.zeiger@databricks.com> GitOrigin-RevId: 5609916eafa039d5c76f5931ed38dc612cb5c231

## What changes were proposed in this pull request? - Migrate to use Spark 3.1.1 adding tests and refactors ## How was this patch tested? - Existing tests Author: Pranav Anand <anandpranavv@gmail.com> #19552 is resolved by pranavanand/pa-delta311migration. GitOrigin-RevId: fd3b86468f07752fbd3d56f653aec683af05d0b4

Author: Meng Tong <meng.tong@databricks.com> Author: Meng Tong <77353730+mengtong-db@users.noreply.github.com> GitOrigin-RevId: 94ea828441cd90856841af4e8c76f75bd485f6b4

Author: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: 4f7d0a28f25f3fcfe7097402f7398e9044139526

Rename Change Data Capture to Change Data Feed Ran existing tests Author: Rahul Mahadev <rahul.mahadev@databricks.com> GitOrigin-RevId: e74e8f0d15e9ccdd733522c53a53fcc79eea90a0

Signed-off-by: Jacek Laskowski <jacek@japila.pl> Closes #637 Signed-off-by: Meng Tong <meng.tong@databricks.com> Author: Jacek Laskowski <jacek@japila.pl> #20259 is resolved by mengtong-db/n91aik90. GitOrigin-RevId: 1f5defe5a0da23234a35af929274a03e282c5c33

I know it's a tiny thing, but someone have got to fix it someday :) Closes #596 Signed-off-by: Yijia Cui <yijia.cui@databricks.com> Author: Antonio <tomrubybarreto@gmail.com> #20377 is resolved by yijiacui-db/6zl3xm8p. GitOrigin-RevId: b3b80dc7215f0fc3fcc216cdb766fe66af816304

Author: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: d7f6bbc3168107eaba2942826bb945c9f6757737

That should lower memory requirements (as no extra objects are created) and improve readability Signed-off-by: Jacek Laskowski <jacek@japila.pl> Closes #638 Signed-off-by: Meng Tong <meng.tong@databricks.com> Author: Jacek Laskowski <jacek@japila.pl> #20440 is resolved by mengtong-db/gnggx483. GitOrigin-RevId: 1bc6619592a6bbb456112933f0181ed6e61078d2

Expect analysis exception for window functions in merge and update. Unit tests. Author: Yijia Cui <yijia.cui@databricks.com> GitOrigin-RevId: a407868f4d87e8d82119796aa3742edd5a8438ec

…ted Columns Currently we store the generation expressions in the column metadata of the table schema. However, when Spark reads the schema from a table, it will also propagate column metadata to downstream operations. For example, let's say table X is a table contains generated columns. The following command will create a new table whose column metadata contains the generation expression. This happens in all DBR versions when reading generated column tables. ``` CREATE TABLE Y AS SELECT * FROM X ``` This is not expected. This PR removes the generation expressions from the column metadata before giving the schema to Spark, so that the generation expressions won't be leaked to downstream operations. However, for old DBR versions, especially the EOS versions, the metadata propagation behavior still exists. But since old dBR versions that don't support generated columns have an old writer version (they don't support writer version 4), we can change the definition of generated columns to: A table has generated columns only if it's min writer version >= 4 and some of columns contain generation expressions in the metadata. With this definition, tables containing generation expressions but created by old DBR versions will be treated as normal tables, and none of generated column code path should be triggered when reading/writing such tables. New unit tests. Author: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: e3116c0d16c9f868aba03326efd82b72f7971b2c

Author: Wenchen Fan <wenchen@databricks.com> GitOrigin-RevId: f274112da57ebd3397d481b8525584542686b6d0

SHOW CREATE TABLE isn't supported in spark 3.1. We should catch exception in unit test instead of expecting correct result. Unit test. Author: Yijia Cui <yijia.cui@databricks.com> GitOrigin-RevId: fc94772eb136d48f748fa37d5cca9879027e46cf

Author: Meng Tong <meng.tong@databricks.com> GitOrigin-RevId: 13ccf8974ae9df9a142dd7a87d1ee3c902327b0c

…ndefined) Since `None` is used for the end boundary of a delta table history it could also easily be "transferred" up the call chain and be the default input value. That's the purpose of the PR. Signed-off-by: Jacek Laskowski <jacek@japila.pl> Closes #633 Signed-off-by: Yijia Cui <yijia.cui@databricks.com> Author: Yijia Cui <yijia.cui@databricks.com> Author: Jacek Laskowski <jacek@japila.pl> #20448 is resolved by yijiacui-db/1oyk8ltj. GitOrigin-RevId: dfda5f3ed910ba3b607111bed87009caf19ba8fe

…from being resolved twice If in any way, a DeltaMergeInto generated from Scala API having a `updateAll()` and schema evolution enable goes through the reference resolution twice, it can throw errors because - The first resolution will expand `star` in the plan to `x = source.x` assignments for all columns in source. This may include columns that are in the source but not yet in the target. - The second resolution, currently, can now throw an error because it does not know The obvious way to solve this by making the resolution idempotent - it will undergo reference resolution only if `plan.resolved` is false. However, it does not handle rare corner cases. The Scala API can generate DeltaMergeInto where all the expressions are already resolved. If we add the conditional check for plan.resolved, then in those cases with pre-resolved expressions, DeltaMergeInto may never go through the reference resolution phase and skip a lot of additional checks besides resolution. This can cause incorrect plans containing target column names that are wrong - since the target column names are stored as Seq[String] and not expressions, plans that containing all resolved expressions but incorrect column names will be considered as already resolved. To get around, this solution in this PR is to add a boolean field `targetColNameResolved` that explicitly represents whether the target column has gone through resolution or not. This `targetColNameResolved` is considered in `expression.resolved` and is set to false by default when generated by Scala API. This forces all plans to go through the resolution phase as the DeltaMergeInto.resolved will always be false even if all the expressions are resolved. In the resolution logic, the checks on the target columns will be done only when `targetColNameResolved` is false, and after the check, it will be set to true. This makes the checks robust on multiple passes - once the star has been expanded to columns and the boolean is set to true, future passes will skip checks. Note: The ideal solution here is to unify SQL and Scala code paths by Scala generated MergeIntoTable which in one shot gets converted into fully resolved DeltaMergeInto thus eliminating possibilities of another resolution attempt. This would be the ideal solution but we cannot do that now because the Assignment class in MergeIntoTable cannot differentially represent `Assignment(<star>)` and `Assignment(<no-columns-to-update>)`. This is important because the Scala API can generate the latter (not the SQL API). This needs to be fixed in Apache Spark so will not be available until Spark 3.2. I have added this contextual information as inline docs for future development. Added a test with a function that failed without this change. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Tathagata Das <tdas@databricks.com> GitOrigin-RevId: d72339b7b48671016b919f5e0f5bb268732fbc68

Author: Ali Afroozeh <ali.afroozeh@databricks.com> GitOrigin-RevId: 2805d80e6cc1953c45117c1a111bf66de805006c

Signed-off-by: Jacek Laskowski <jacek@japila.pl> Closes #641 Signed-off-by: Yijia Cui <yijia.cui@databricks.com> Author: Jacek Laskowski <jacek@japila.pl> GitOrigin-RevId: d03067231e3b2f73fc32d76f0feb19622d9968b4

…mited clauses Schema evolution didn't originally work with unlimited clauses. We need a test to prevent this from regressing in the future. n/a test only PR Author: Jose Torres <joseph.torres@databricks.com> GitOrigin-RevId: 6ba537a531fe591b8fbb8f2a1e03fc242c8f88ab

…s public in Scala Make concurrent modification exceptions related apis public in Scala. Unit tests. Author: Yijia Cui <yijia.cui@databricks.com> GitOrigin-RevId: a47a79db4a5d5bb799dc3510a41d7a0777eb54de

Author: Rahul Mahadev <rahul.mahadev@databricks.com> GitOrigin-RevId: f4210034290eb7c9ab6cca69529c555ef37d9819

…l incompatibility message. Use "change data feed" in the public-facing protocol incompatibility message. n/a Author: Jose Torres <joseph.torres@databricks.com> GitOrigin-RevId: 219e33773ce7f06b56d879aff521f544395f1ecb

Drop the cdc field from the checkpoint file. (Note that the actual CDC actions are already filtered out of the snapshot state in InMemoryLogReplay - right now this column is always null.) new unit test Author: Jose Torres <joseph.torres@databricks.com> GitOrigin-RevId: 8567d09a99b8fbba053930dff5695e0e67238961

Author: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: ef04d52ba1110134cb0eccf54d77342705c2c6f8

…ted columns. Fix MERGE INTO evolution for partial updates of nested columns. We need to pass the flag to permit struct evolution, even though the UPDATE operation doesn't actually reference the new columns, because generateUpdateExpressions will implicitly generate them in order to produce one update action per target column. new unit test Instead of throwing an error this use case will work Author: Jose Torres <joseph.torres@databricks.com> GitOrigin-RevId: 5a4b68082bb329822d4361b7e4a764ef061cf878

Making Delta a multi-module project will enable us to add other sub-modules. For example, we can then add a contribs sub-module that can have contributions from the community that needs to be very closely tied to the delta-core project (hence in this repo, and not delta/connectors) but does not have the same level of maturity as delta-core. Changes made in the Delta repeeo - Moved all files in root/src/ to root/core/src/ - update build.sbt to multiple modules - Removed dependency on spark-packages. existing tests Closes #644 Author: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: 68038d27302e82f6e680fe717633109757e48ba0

…f bintray As the title says manual publish to sonatype staging. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Tathagata Das <tdas@databricks.com> GitOrigin-RevId: e4d76cf07334e20dd0ef4238430690944df01189

Make concurrent modification exceptions related apis public in Python. Unit tests. Author: Yijia Cui <yijia.cui@databricks.com> GitOrigin-RevId: 0a22e06162b1ee6747d0b55da1871fa1f142b56d

Add unit test for aggregate expression. Unit test. Author: Yijia Cui <yijia.cui@databricks.com> GitOrigin-RevId: 77636d544abfb53dda95b9dffcc3bb15474e38cc

- Enable temp views with Spark 3.1.1 for Delete and Update - Tests in Update and Delete Author: Pranav Anand <anandpranavv@gmail.com> GitOrigin-RevId: e3d34029be093ae960e7c3de0abca83bfedcc9e6

linhongliu-db and others added 30 commits April 5, 2021 11:06

Improve merge test

26e217e

Author: Linhong Liu <linhong.liu@databricks.com> (cherry picked from commit 3d79e78ee2fd05936ffb87b67ec1039e26257ba5) Signed-off-by: Ubuntu <ubuntu@ip-10-110-16-101.us-west-2.compute.internal> GitOrigin-RevId: 6524e412ea882e46e3d15db4a9ba22eee6eec125

[SC-72276] Add a config and test

7cb5115

This PR adds a test so that we can detect #618 Author: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: 1af03b03f64c607c8f61b41eef678f3a72355ad7

[SC-68592] Refactor Delta invariant check

a361a05

Author: herman <herman@databricks.com> GitOrigin-RevId: b03bd9be547625516b0cef80522322af85a4d05a

Improve evolvability test

88357fb

Author: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: 1cc8f84c0c5a3c04910feed934d3ad66b869ae77

Improve comments

3dee337

Author: yaohua <yaohua.zhao@databricks.com> GitOrigin-RevId: bd103eeeb672424b8133aea732b9d66381f4d77f

Refactor methods in SchemaUtils

571b4ab

Author: liwensun <liwen.sun@databricks.com> GitOrigin-RevId: d48e075d5b6cdc55bf2b139016b3399ecfccf10c

[SC-71855] Improve vacuum logging

e4c85e6

Improve vacuum logging Add unit test Author: Rahul Mahadev <rahul.mahadev@databricks.com> GitOrigin-RevId: 44ffdb72030de6ac6aadb9590238effe43bbaf4d

Refactor in DeltaAnalysis

bc6992f

Authored-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit be888b27edfbb0d7ebb2265de1bf74acb8d3d09a) Author: Wenchen Fan <wenchen@databricks.com> GitOrigin-RevId: ff2bea17c03ac092693c1db4610d800889b8ea49

[SC-69552] Minor refactor.

34566e0

Author: Lars Kroll <lars.kroll@databricks.com> GitOrigin-RevId: 33a4fcdf50af72096e800bc5b2b4cc45476cb735

[SC-74019] Refactor getStartingVersionFromTimestamp

06e28f5

Author: Rahul Mahadev <rahul.mahadev@databricks.com> GitOrigin-RevId: e7e94e70ff2795a81ec73f12ba0a91ccf730056c

Revert "[SC-59632][DELTA] Add a test for MERGE INTO schema evolution …

73ade72

…for unlimited clauses" Author: Stefan Zeiger <stefan.zeiger@databricks.com> GitOrigin-RevId: 5609916eafa039d5c76f5931ed38dc612cb5c231

[SC-69797] Refactor around bucketing

e755720

Author: Meng Tong <meng.tong@databricks.com> Author: Meng Tong <77353730+mengtong-db@users.noreply.github.com> GitOrigin-RevId: 94ea828441cd90856841af4e8c76f75bd485f6b4

[SC-63352] Refactor for generated columns

161ce09

Author: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: 4f7d0a28f25f3fcfe7097402f7398e9044139526

[SC-74667] Rename Change Data Capture to Change Data Feed

262c6c6

Rename Change Data Capture to Change Data Feed Ran existing tests Author: Rahul Mahadev <rahul.mahadev@databricks.com> GitOrigin-RevId: e74e8f0d15e9ccdd733522c53a53fcc79eea90a0

[SC-74809][Delta] Minor Refactoring in DeltaOperations

c19b159

Author: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: d7f6bbc3168107eaba2942826bb945c9f6757737

[SC-66811][Delta] Test window functions in DML commands.

59f7996

Expect analysis exception for window functions in merge and update. Unit tests. Author: Yijia Cui <yijia.cui@databricks.com> GitOrigin-RevId: a407868f4d87e8d82119796aa3742edd5a8438ec

Improve error handling in tests.

c6d29e9

Author: Wenchen Fan <wenchen@databricks.com> GitOrigin-RevId: f274112da57ebd3397d481b8525584542686b6d0

yijiacui-db and others added 18 commits April 19, 2021 11:08

[SC-69797][Delta] Remove unused fields.

9063677

Author: Meng Tong <meng.tong@databricks.com> GitOrigin-RevId: 13ccf8974ae9df9a142dd7a87d1ee3c902327b0c

Minor refactoring in deltaConstraints.

2c99ace

Author: Ali Afroozeh <ali.afroozeh@databricks.com> GitOrigin-RevId: 2805d80e6cc1953c45117c1a111bf66de805006c

[DELTA-OSS-EXTERNAL] [MINOR] Remove PreprocessTableDelete.toCommand

63aba97

Signed-off-by: Jacek Laskowski <jacek@japila.pl> Closes #641 Signed-off-by: Yijia Cui <yijia.cui@databricks.com> Author: Jacek Laskowski <jacek@japila.pl> GitOrigin-RevId: d03067231e3b2f73fc32d76f0feb19622d9968b4

[SC-42766][Delta] Make concurrent modification exceptions related api…

9e3240f

…s public in Scala Make concurrent modification exceptions related apis public in Scala. Unit tests. Author: Yijia Cui <yijia.cui@databricks.com> GitOrigin-RevId: a47a79db4a5d5bb799dc3510a41d7a0777eb54de

[SC-75235][DELTA] Minor Refactoring in Vacuum.

942cdfc

Author: Rahul Mahadev <rahul.mahadev@databricks.com> GitOrigin-RevId: f4210034290eb7c9ab6cca69529c555ef37d9819

[SC-74439] Minor refactoring in GeneratedColumns

10df749

Author: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: ef04d52ba1110134cb0eccf54d77342705c2c6f8

[SC-42766][Delta]Make concurrent exceptions as public apis in python

d0f0257

Make concurrent modification exceptions related apis public in Python. Unit tests. Author: Yijia Cui <yijia.cui@databricks.com> GitOrigin-RevId: 0a22e06162b1ee6747d0b55da1871fa1f142b56d

[SC-68794][Delta]Add test for aggregate expression for DML

af8d321

Add unit test for aggregate expression. Unit test. Author: Yijia Cui <yijia.cui@databricks.com> GitOrigin-RevId: 77636d544abfb53dda95b9dffcc3bb15474e38cc

[Delta OSS][SC-74312] enable update and delete on temp views

34002d3

- Enable temp views with Spark 3.1.1 for Delete and Update - Tests in Update and Delete Author: Pranav Anand <anandpranavv@gmail.com> GitOrigin-RevId: e3d34029be093ae960e7c3de0abca83bfedcc9e6

JassAbidi merged commit eee16c7 into JassAbidi:master May 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update fork #9

update fork #9

JassAbidi commented May 1, 2021

update fork #9

update fork #9

Conversation

JassAbidi commented May 1, 2021