Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update fork #9

Merged
merged 48 commits into from
May 1, 2021
Merged

update fork #9

merged 48 commits into from
May 1, 2021

Conversation

JassAbidi
Copy link
Owner

No description provided.

linhongliu-db and others added 30 commits April 5, 2021 11:06
Author: Linhong Liu <linhong.liu@databricks.com>

(cherry picked from commit 3d79e78ee2fd05936ffb87b67ec1039e26257ba5)
Signed-off-by: Ubuntu <ubuntu@ip-10-110-16-101.us-west-2.compute.internal>
GitOrigin-RevId: 6524e412ea882e46e3d15db4a9ba22eee6eec125
This PR adds a test so that we can detect #618

Author: Tathagata Das <tathagata.das1565@gmail.com>

GitOrigin-RevId: 1af03b03f64c607c8f61b41eef678f3a72355ad7
Author: herman <herman@databricks.com>

GitOrigin-RevId: b03bd9be547625516b0cef80522322af85a4d05a
Author: Tathagata Das <tathagata.das1565@gmail.com>

GitOrigin-RevId: 1cc8f84c0c5a3c04910feed934d3ad66b869ae77
Author: yaohua <yaohua.zhao@databricks.com>

GitOrigin-RevId: bd103eeeb672424b8133aea732b9d66381f4d77f
Author: liwensun <liwen.sun@databricks.com>

GitOrigin-RevId: d48e075d5b6cdc55bf2b139016b3399ecfccf10c
…ocol specification

Add missing fields in the RemoveFile of the protocol specification

Closes #613

Signed-off-by: Rahul Mahadev <rahul.mahadev@databricks.com>

Author: fvaleye <florian.valeye@gmail.com>

#19691 is resolved by rahulsmahadev/yhhem4p5.

GitOrigin-RevId: bf3b646a83d830be1b04d83e2cdb566f744dfd39
Before this PR, when we create a checkpoint, we use the snapshot established at the transaction start and create a checkpoint for that version. This might cause multiple transactions to checkpoint a same version and potentially lead to corrupted checkpoint status. In this PR, we change to checkpoint the version committed by this transaction to avoid such scenario. The downside of this approach is that we pay extra cost to get the new snapshot occasionally. We expect this is a rare case and we can tolerate the cost.

Existing tests.

Author: Meng Tong <meng.tong@databricks.com>

GitOrigin-RevId: 1858f58f69e0618924709631b67f81fe1d0b863d
Currently Delta allows some safe type change such as from SMALLINT to INT.

However, this may break generated column contract. For example, let's say we have a column c1, and a generated column c2 defined as `CAST(hash(c1 + 32767s) AS SMALLINT)`. When c1's type is SMALLINT and we insert `32767s`, the expression will return 31349, but if c1's type is INT and we insert `32767`, the expression will return 9876. This means changing the column type may require to rewrite the existing data. But since it's too heavy, we can simply disallow it.

New unit test

Author: Shixiong Zhu <zsxwing@gmail.com>

GitOrigin-RevId: 93a0475fed83ec6a751e4b05902aec3fa71410a5
Improve vacuum logging

Add unit test

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

GitOrigin-RevId: 44ffdb72030de6ac6aadb9590238effe43bbaf4d
Converting all the predicates into CNF may result in a very long predicate and the codegen become unnecessarily large.
We should follow the approach of apache/spark#29101, which extracts all convertiable predicates gracefully.

Author: Gengliang Wang <gengliang.wang@databricks.com>

GitOrigin-RevId: 9ecedbd83f85b8235262c3de0bd83b4540cbc560
Author: Gengliang Wang <ltnwgl@gmail.com>
Author: Gengliang Wang <gengliang.wang@databricks.com>
Author: Wenchen Fan <cloud0fan@gmail.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Kris Mok <kris.mok@databricks.com>

GitOrigin-RevId: 1d6c55bcd5232af4a0a486dfc3bb97afcc439f1f
…n DeltaFileOperations

Added two methods, `recursiveListFrom` and `localListFrom`, into `DeltaFileOperations`.

Currently, the only way to list files is to specify a directory path, then all files under that directory will be listed. This is wasteful if we only want _new_ files after a certain filename, instead of needing to re-list the entire directory and filter thereafter.

These two methods allow you to specify a directory name and a path (in that directory) from which to list from. Then, taking advantage of `LogStore.listFrom`, only files with filenames after the specified path will be returned.

- Added tests in `DeltaFileOperationsSuite`.

Author: Howard Xiao <howard.xiao@databricks.com>

GitOrigin-RevId: af5712e60224835cf00e247c6a9f740e04126968
Authored-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit be888b27edfbb0d7ebb2265de1bf74acb8d3d09a)

Author: Wenchen Fan <wenchen@databricks.com>

GitOrigin-RevId: ff2bea17c03ac092693c1db4610d800889b8ea49
…mited clauses

Schema evolution didn't originally work with unlimited clauses. We need a test to prevent this from regressing in the future.

Right now there's just a Scala suite test, since the unlimited clauses test harness doesn't support evolution and the evolution test harness doesn't support unlimited clauses, so it's complicated to write a test that the ACL and CDC extensions of the SQL merge suite will correctly be able to skip.

new unit test

Author: Jose Torres <joseph.torres@databricks.com>

GitOrigin-RevId: 4e492459966fafb23d1d5b3d4fea95656d02cf55
Add Delta Lake cheat sheet to `/examples/cheat_sheet/`.

Closes #628

Co-authored-by: Brenner Heintz <brenner.heintz@gmail.com>
Signed-off-by: Meng Tong <meng.tong@databricks.com>

Author: brennerh1 <65046554+brennerh1@users.noreply.github.com>

GitOrigin-RevId: 1086666e5ccf56841f8c5e32d94af61ca14913ff
Author: Lars Kroll <lars.kroll@databricks.com>

GitOrigin-RevId: 33a4fcdf50af72096e800bc5b2b4cc45476cb735
Author: Rahul Mahadev <rahul.mahadev@databricks.com>

GitOrigin-RevId: e7e94e70ff2795a81ec73f12ba0a91ccf730056c
…for unlimited clauses"

Author: Stefan Zeiger <stefan.zeiger@databricks.com>

GitOrigin-RevId: 5609916eafa039d5c76f5931ed38dc612cb5c231
## What changes were proposed in this pull request?
 - Migrate to use Spark 3.1.1 adding tests and refactors

## How was this patch tested?
 - Existing tests

Author: Pranav Anand <anandpranavv@gmail.com>

#19552 is resolved by pranavanand/pa-delta311migration.

GitOrigin-RevId: fd3b86468f07752fbd3d56f653aec683af05d0b4
Author: Meng Tong <meng.tong@databricks.com>
Author: Meng Tong <77353730+mengtong-db@users.noreply.github.com>

GitOrigin-RevId: 94ea828441cd90856841af4e8c76f75bd485f6b4
Author: Shixiong Zhu <zsxwing@gmail.com>

GitOrigin-RevId: 4f7d0a28f25f3fcfe7097402f7398e9044139526
Rename Change Data Capture to Change Data Feed

Ran existing tests

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

GitOrigin-RevId: e74e8f0d15e9ccdd733522c53a53fcc79eea90a0
Signed-off-by: Jacek Laskowski <jacek@japila.pl>

Closes #637

Signed-off-by: Meng Tong <meng.tong@databricks.com>

Author: Jacek Laskowski <jacek@japila.pl>

#20259 is resolved by mengtong-db/n91aik90.

GitOrigin-RevId: 1f5defe5a0da23234a35af929274a03e282c5c33
I know it's a tiny thing, but someone have got to fix it someday :)

Closes #596

Signed-off-by: Yijia Cui <yijia.cui@databricks.com>

Author: Antonio <tomrubybarreto@gmail.com>

#20377 is resolved by yijiacui-db/6zl3xm8p.

GitOrigin-RevId: b3b80dc7215f0fc3fcc216cdb766fe66af816304
Author: Tathagata Das <tathagata.das1565@gmail.com>

GitOrigin-RevId: d7f6bbc3168107eaba2942826bb945c9f6757737
That should lower memory requirements (as no extra objects are created) and improve readability

Signed-off-by: Jacek Laskowski <jacek@japila.pl>

Closes #638

Signed-off-by: Meng Tong <meng.tong@databricks.com>

Author: Jacek Laskowski <jacek@japila.pl>

#20440 is resolved by mengtong-db/gnggx483.

GitOrigin-RevId: 1bc6619592a6bbb456112933f0181ed6e61078d2
Expect analysis exception for window functions in merge and update.

Unit tests.

Author: Yijia Cui <yijia.cui@databricks.com>

GitOrigin-RevId: a407868f4d87e8d82119796aa3742edd5a8438ec
…ted Columns

Currently we store the generation expressions in the column metadata of the table schema. However, when Spark reads the schema from a table, it will also propagate column metadata to downstream operations. For example, let's say table X is a table contains generated columns. The following command will create a new table whose column metadata contains the generation expression. This happens in all DBR versions when reading generated column tables.

```
CREATE TABLE Y AS SELECT * FROM X
```
This is not expected.

This PR removes the generation expressions from the column metadata before giving the schema to Spark, so that the generation expressions won't be leaked to downstream operations.

However, for old DBR versions, especially the EOS versions, the metadata propagation behavior still exists. But since old dBR versions that don't support generated columns have an old writer version (they don't support writer version 4), we can change the definition of generated columns to:

A table has generated columns only if it's min writer version >= 4 and some of columns contain generation expressions in the metadata.

With this definition, tables containing generation expressions but created by old DBR versions will be treated as normal tables, and none of generated column code path should be triggered when reading/writing such tables.

New unit tests.

Author: Shixiong Zhu <zsxwing@gmail.com>

GitOrigin-RevId: e3116c0d16c9f868aba03326efd82b72f7971b2c
Author: Wenchen Fan <wenchen@databricks.com>

GitOrigin-RevId: f274112da57ebd3397d481b8525584542686b6d0
yijiacui-db and others added 18 commits April 19, 2021 11:08
SHOW CREATE TABLE isn't supported in spark 3.1. We should catch exception in unit test instead of expecting correct result.

Unit test.

Author: Yijia Cui <yijia.cui@databricks.com>

GitOrigin-RevId: fc94772eb136d48f748fa37d5cca9879027e46cf
Author: Meng Tong <meng.tong@databricks.com>

GitOrigin-RevId: 13ccf8974ae9df9a142dd7a87d1ee3c902327b0c
…ndefined)

Since `None` is used for the end boundary of a delta table history it could also easily be "transferred" up the call chain and be the default input value. That's the purpose of the PR.

Signed-off-by: Jacek Laskowski <jacek@japila.pl>

Closes #633

Signed-off-by: Yijia Cui <yijia.cui@databricks.com>

Author: Yijia Cui <yijia.cui@databricks.com>
Author: Jacek Laskowski <jacek@japila.pl>

#20448 is resolved by yijiacui-db/1oyk8ltj.

GitOrigin-RevId: dfda5f3ed910ba3b607111bed87009caf19ba8fe
…from being resolved twice

If in any way, a DeltaMergeInto generated from Scala API having a `updateAll()` and schema evolution enable goes through the reference resolution twice, it can throw errors because
- The first resolution will expand `star` in the plan to `x = source.x` assignments for all columns in source. This may include columns that are in the source but not yet in the target.
- The second resolution, currently, can now throw an error because it does not know

The obvious way to solve this by making the resolution idempotent - it will undergo reference resolution only if `plan.resolved` is false. However, it does not handle rare corner cases. The Scala API can generate DeltaMergeInto where all the expressions are already resolved. If we add the conditional check for plan.resolved, then in those cases with pre-resolved expressions, DeltaMergeInto may never go through the reference resolution phase and skip a lot of additional checks besides resolution. This can cause incorrect plans containing target column names that are wrong - since the target column names are stored as Seq[String] and not expressions, plans that containing all resolved expressions but incorrect column names will be considered as already resolved.

To get around, this solution in this PR is to add a boolean field `targetColNameResolved` that explicitly represents whether the target column has gone through resolution or not. This `targetColNameResolved` is considered in `expression.resolved` and is set to false by default when generated by Scala API. This forces all plans to go through the resolution phase as the DeltaMergeInto.resolved will always be false even if all the expressions are resolved. In the resolution logic, the checks on the target columns will be done only when `targetColNameResolved` is false, and after the check, it will be set to true. This makes the checks robust on multiple passes - once the star has been expanded to columns and the boolean is set to true, future passes will skip checks.

Note: The ideal solution here is to unify SQL and Scala code paths by Scala generated MergeIntoTable which in one shot gets converted into fully resolved DeltaMergeInto thus eliminating possibilities of another resolution attempt. This would be the ideal solution but we cannot do that now because the Assignment class in MergeIntoTable cannot differentially represent `Assignment(<star>)` and `Assignment(<no-columns-to-update>)`. This is important because the Scala API can generate the latter (not the SQL API). This needs to be fixed in Apache Spark so will not be available until Spark 3.2. I have added this contextual information as inline docs for future development.

Added a test with a function that failed without this change.

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Tathagata Das <tdas@databricks.com>

GitOrigin-RevId: d72339b7b48671016b919f5e0f5bb268732fbc68
Author: Ali Afroozeh <ali.afroozeh@databricks.com>

GitOrigin-RevId: 2805d80e6cc1953c45117c1a111bf66de805006c
Signed-off-by: Jacek Laskowski <jacek@japila.pl>

Closes #641

Signed-off-by: Yijia Cui <yijia.cui@databricks.com>

Author: Jacek Laskowski <jacek@japila.pl>

GitOrigin-RevId: d03067231e3b2f73fc32d76f0feb19622d9968b4
…mited clauses

Schema evolution didn't originally work with unlimited clauses. We need a test to prevent this from regressing in the future.

n/a test only PR

Author: Jose Torres <joseph.torres@databricks.com>

GitOrigin-RevId: 6ba537a531fe591b8fbb8f2a1e03fc242c8f88ab
…s public in Scala

Make concurrent modification exceptions related apis public in Scala.

Unit tests.

Author: Yijia Cui <yijia.cui@databricks.com>

GitOrigin-RevId: a47a79db4a5d5bb799dc3510a41d7a0777eb54de
Author: Rahul Mahadev <rahul.mahadev@databricks.com>

GitOrigin-RevId: f4210034290eb7c9ab6cca69529c555ef37d9819
…l incompatibility message.

Use "change data feed" in the public-facing protocol incompatibility message.

n/a

Author: Jose Torres <joseph.torres@databricks.com>

GitOrigin-RevId: 219e33773ce7f06b56d879aff521f544395f1ecb
Drop the cdc field from the checkpoint file. (Note that the actual CDC actions are already filtered out of the snapshot state in InMemoryLogReplay - right now this column is always null.)

new unit test

Author: Jose Torres <joseph.torres@databricks.com>

GitOrigin-RevId: 8567d09a99b8fbba053930dff5695e0e67238961
Author: Shixiong Zhu <zsxwing@gmail.com>

GitOrigin-RevId: ef04d52ba1110134cb0eccf54d77342705c2c6f8
…ted columns.

Fix MERGE INTO evolution for partial updates of nested columns. We need to pass the flag to permit struct evolution, even though the UPDATE operation doesn't actually reference the new columns, because generateUpdateExpressions will implicitly generate them in order to produce one update action per target column.

new unit test

Instead of throwing an error this use case will work

Author: Jose Torres <joseph.torres@databricks.com>

GitOrigin-RevId: 5a4b68082bb329822d4361b7e4a764ef061cf878
Making Delta a multi-module project will enable us to add other sub-modules. For example, we can then add a contribs sub-module that can have contributions from the community that needs to be very closely tied to the delta-core project (hence in this repo, and not delta/connectors) but does not have the same level of maturity as delta-core.

Changes made in the Delta repeeo
- Moved all files in root/src/ to root/core/src/
- update build.sbt to multiple modules
  - Removed dependency on spark-packages.

existing tests

Closes #644

Author: Tathagata Das <tathagata.das1565@gmail.com>

GitOrigin-RevId: 68038d27302e82f6e680fe717633109757e48ba0
…f bintray

As the title says

manual publish to sonatype staging.

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Tathagata Das <tdas@databricks.com>

GitOrigin-RevId: e4d76cf07334e20dd0ef4238430690944df01189
Make concurrent modification exceptions related apis public in Python.

Unit tests.

Author: Yijia Cui <yijia.cui@databricks.com>

GitOrigin-RevId: 0a22e06162b1ee6747d0b55da1871fa1f142b56d
Add unit test for aggregate expression.

Unit test.

Author: Yijia Cui <yijia.cui@databricks.com>

GitOrigin-RevId: 77636d544abfb53dda95b9dffcc3bb15474e38cc
 - Enable temp views with Spark 3.1.1 for Delete and Update

 - Tests in Update and Delete

Author: Pranav Anand <anandpranavv@gmail.com>

GitOrigin-RevId: e3d34029be093ae960e7c3de0abca83bfedcc9e6
@JassAbidi JassAbidi merged commit eee16c7 into JassAbidi:master May 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.