Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update fork #2

Merged
merged 56 commits into from
Apr 1, 2020
Merged

update fork #2

merged 56 commits into from
Apr 1, 2020

Conversation

JassAbidi
Copy link
Owner

No description provided.

rahulsmahadev and others added 30 commits January 3, 2020 10:54
Fix package names for DML commands Java suite to have package `test.org.apache.spark.sql.delta` instead of `test.com.databricks.sql.transaction.tahoe`

Ran test builders

Closes #295
Closes #290

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

GitOrigin-RevId: 92f0ab08e238b182c76c24456fe276a79e4f82e1
Allow file names to include = when convert to delta

Signed-off-by: Yishuang Lu <luystu@gmail.com>

Closes #264

Author: lys0716 <luystu@gmail.com>

#7514 is resolved by zsxwing/2s7xlu95.

GitOrigin-RevId: cbbbc42443fa277738c98c794f6d4c41403a3a3e
When replaying a log for an uncheckpointed Delta table, we first compute an empty snapshot. While this is correct, it does nearly double the time spend on the executor for these tables.

This PR changes this by using the InitialSnapshot class for the initial replay. This class has been modified to make sure it avoids expensive computation (constructing dataframes, caching & executing jobs).

Added UT to `DeltaSuite`.

Author: herman <herman@databricks.com>
Author: Youngbin Kim <youngbin.kim@databricks.com>

GitOrigin-RevId: 5cb7fe0517f3a8f9c032657cc27851ea15da0c62
…nal catalog fails

Catch `AnalysisException` in `ConvertToDelta` when checking whether `databaseExists` to try to use path instead of throwing exception

Author: Pranav Anand <anandpranavv@gmail.com>

GitOrigin-RevId: b882d316f28f58cd9b41d95eb1f62a610b5d5f6a
Closes #211

Author: Anurag870 <f2005870@gmail.com>

#7544 is resolved by zsxwing/lfpuuecc.

GitOrigin-RevId: 76f108cedc7ba486c97a016ed34954a673e3e15b
Remove `reservoirScheme` which is never used.

Closes #237

Author: Andrew Fogarty <andrew.f.fogarty@gmail.com>

GitOrigin-RevId: 9115b0a6d20f5415367918b768c2ba26a2dae5bb
We have special logic to handle partition filters and time travel specifications being part of a path to load a Delta table. This logic exists in createRelation of the V1 relation of Delta. This PR refactors some of this logic to prepare for Spark 3.0.

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

GitOrigin-RevId: a299b927ded7a1fb904ca8af9ee4d825e79ffb65
Added Describe history metrics for Truncate Command

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

GitOrigin-RevId: 3faea5cf832937a203fd995d8104ef8e415c7ce1
…e validation

Author: Shixiong Zhu <zsxwing@gmail.com>

GitOrigin-RevId: 0b7170ac0bf9d89565d1efb82d51134738d64932
… is a path or not

## What changes were proposed in this pull request?
 - New trait `ResolveTableIdentifier` which can be used by commands to determine if a `TableIdentifier` refers to a path based table or a table from a metastore
 - Make `ConvertToDelta` use `ResolveTableIdentifier`

## How was this patch tested?
 - This PR does not add functionality. Existing tests

Author: Pranav Anand <anandpranavv@gmail.com>

#7569 is resolved by pranavanand/pa-istableorpath.

GitOrigin-RevId: 76e52d04fe3a140983872f909081e573a916b466
What were proposed in this Pull Request?

1. Scala doesn't require `Return` keyword.
2. Unwanted parentheses has been removed.
3. Curly braces removed for variable, it's need for expression.
4. map - lamdba function doesn't need anonymous function. Example - `val replacedDb = database.map(quoteIdentifier(_))`.
5. `new` keyword is only required for class not for case class, it automatically executes `apply` method.
6. Collection type check, `!columns.isEmpty` has changed to `columns.nonEmpty` for better readability.
7. dataframe object declaration from `var` change to `val` for immutability.

How this pull request is tested?
* ran `sbt compile` to check compile time errors. PASSED
* Checked circle-ci log, could not see any exception.

Looking forward for your review comments. @brkyvz

Closes #286

Author: chet <ckhatrimanjal@gmail.com>

#7611 is resolved by brkyvz/flxve6u6.

GitOrigin-RevId: 531ce1f4b69ab31cf91c7c95d10fef362fd7f349
## What changes were proposed in this pull request?

This PR adds metrics for Streaming Updates via Describe History.

## How was this patch tested?
Added a test in `DescribeHistorySuite`

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

#7493 is resolved by rahulsmahadev/historyStatsStreamingup.

GitOrigin-RevId: d108c4b90e854188d920e3075f4d100db326e288
Author: Xiao Li <gatorsmile@gmail.com>

GitOrigin-RevId: 90ed86bc78b461b9e96fe01511f125a47dc7b3d5
…reDeltaScan

This PR adds a SQLConf

Author: Ali Afroozeh <ali.afroozeh@databricks.com>

GitOrigin-RevId: 4ee5f1f5837e85d4344bbc96eadb87bec284591b
## What changes were proposed in this pull request?
This PR adds metrics for OPTIMIZE Command via Describe History.

## How was this patch tested?
Added unit tests in `DescribeHistorySuiteEdge`

Closes #7492 from rahulsmahadev/historyStatsOptimize.

Authored-by: Rahul Mahadev <rahul.mahadev@databricks.com>
Signed-off-by: Rahul Mahadev <rahul.mahadev@databricks.com>
GitOrigin-RevId: 1becff31f20533c133fb82cc503bc58f8d6f501b
## What changes were proposed in this pull request?
This PR adds Describe History metrics for Update and Delete

## How was this patch tested?
added tests in `DescribeHistorySuite`

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

#7494 is resolved by rahulsmahadev/historyStatsUpdateDelete.

GitOrigin-RevId: 913dd271f0ecf6976582f74ac55c6981c54b2b22
- Made sphinx throw all warnings as errors. Sphinx tends to mark a build successful even if there are major issues (e.g., import not found) that cause the contents of built docs to be invalid. These issues shows up as warning, and converting them to errors allows them to be caught in the CI/CD loop early on.
- Fixed indentation issues in docs.
- Ignore pyspark not being present during python doc generation.

Closes #300

Author: Tathagata Das <tathagata.das1565@gmail.com>

#7698 is resolved by tdas/nnjju8mt.

GitOrigin-RevId: 86e3d5ac9a7f6e2698702138d3735f817177ecca
Here are the two bugs fixed.

1. Insert condition should be resolved only with source and not with source+target. This was because clause conditions were being resolved with the full `merge` plan (i.e., with both source and target output attribs) independent of the clause type. This PR fixes it by using the `planToResolveAction` to resolve the references of the condition, which is customized for each clause type.

2. Fix for bug #252 where incorrect columns were being printed as valid columns names. In the code, `plan.references` were being printed as valid column names. This is wrong because `references` includes invalid column names as well. This PR fixes it by using the output attributes of `plan.children` which are the only valid column names that can be referred to.

Updated unit tests to verify the presence/absence of valid/invalid column names.

Closes #252

Closes #303

Author: Tathagata Das <tathagata.das1565@gmail.com>

#7702 is resolved by tdas/u9qiqvwd.

GitOrigin-RevId: 26a1458dc3deb05f2398b0c5daec3d4ef5a9a1a7
## What changes were proposed in this pull request?
 - Change `findColumnPosition` for `MapType` to be able to tell if position needed is for a key or for a value and support `ArrayType`
 - Change `addColumn` to support `ArrayType` and `MapType`

## How was this patch tested?
 - Add tests for `MapType` and `ArrayType`

Author: Pranav Anand <anandpranavv@gmail.com>

#7638 is resolved by pranavanand/pa-30-altertable-addcolumn.

GitOrigin-RevId: 755b00b3d788ea24f179cef0220ee7d8d4eb3e94
…Apache Merge classes

## What changes were proposed in this pull request?

Renamed all Merge* classes to DeltaMerge* classes.

## How was this patch tested?

Existing tests

Author: Tathagata Das <tathagata.das1565@gmail.com>

#7763 is resolved by tdas/SC-28544.

GitOrigin-RevId: 5f19f332601804befb0f1f83c45c34f2601eaee6
Added operation metrics for `FSCK` command

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

GitOrigin-RevId: 17853fe70a302c6fd499fa44e6432b167fe38b16
GitOrigin-RevId: d6f628d292243095c2e6a668d1df6aecf20c7a28
## What changes were proposed in this pull request?

added operation metrics for `CREATE TABLE`.

Note: our SQLMetrics framework will allow us to siphon the metrics directly from the underlying `WRITE`

## How was this patch tested?
Added unit test in `DescribeDeltaHistorySuite`

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

#7771 is resolved by rahulsmahadev/historyCreateTable.

GitOrigin-RevId: 723425ad1a03d674bfb93222e6a3309088c59cc2
declare getBasePath in TahoeFileIndex

Author: Tom van Bussel <tom.vanbussel@databricks.com>

GitOrigin-RevId: 3c0e8be777db0ebe123cccc7baf8ca35e09a1481
## What changes were proposed in this pull request?
This PR re-enables one of the initial snapshot tests that was disabled during the last batch merge. It turns out that the payload of a `SparkListenerJobStart` event slightly changed, it now does not contain any stage info when the job is empty.

## How was this patch tested?
It is a test.

Author: herman <herman@databricks.com>

#7860 is resolved by hvanhovell/SC-28546.

GitOrigin-RevId: 2aa3174b1d1c13243c6133dde3f120827b9c7642
…data from unpartitioned tables

fixes #275

Closes #277

Closes #7931 from tdas/1u0hfort.

Lead-authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Co-authored-by: Tathagata Das <tdas@databricks.com>
Co-authored-by: lswyyy <228930204@qq.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
GitOrigin-RevId: fa35d0ea76e2973e84134e34e901be50d17e0f39
## What changes were proposed in this pull request?
Added row level metrics for Delete

## How was this patch tested?
added unit tests

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

#7840 is resolved by rahulsmahadev/rowLevelDelete.

GitOrigin-RevId: 7f5629ebb92534f55458e1449e17713c4244c526
Author: Burak Yavuz <brkyvz@gmail.com>

GitOrigin-RevId: 954c109b114cf59b76fa5fb6b6d9adcff5b95e39
## What changes were proposed in this pull request?
Adding row level metrics for UpdateTableCommand in Delta. this PR will add instrumentation to track the number of updated rows and the number of copied rows during a Delta Update.

## How was this patch tested?
Added unit tests.

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

#7821 is resolved by rahulsmahadev/rowLevelUD.

GitOrigin-RevId: b1f99cd05a215675c19c9804bc9751ef0ba487e4
## What changes were proposed in this pull request?
Added operation metrics for `CONVERT TO DELTA`

-> Additionally added a missing field in `STREAMING UPDATE`

## How was this patch tested?
added a test in `DescribeDeltaHistorySuite`

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

#7767 is resolved by rahulsmahadev/historyConvert.

GitOrigin-RevId: ad1354fef0dff1f4f8d734318de606cb3e0450c3
dbaliafroozeh and others added 26 commits February 26, 2020 12:40
This PR introduces the `MetadataGetter` trait, which  provides an interface for getting the metadata information of a Delta log. Also, the `StateMetadataGetter` class is added that implements the `MetadataGetter` interface and extracts the metadata information from the state of a Delta log (`Dataset[SingleAction]`).

It's a refactoring, passes the existing tests.

Author: Ali Afroozeh <ali.afroozeh@databricks.com>

GitOrigin-RevId: dbe250c55567892f4a13d2ab911a5e53f43cad76
Introduces a `compositeId`, which should be used in cases where we compare two Delta tables.

Regression test

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
GitOrigin-RevId: e62a1d1a2ccd92d4b67d7edad1fb80c4ba7a5497
## What changes were proposed in this pull request?

Purpose of this PR is to enable Operation metrics by default and fix the tests that fail when they are enabled by default.
## How was this patch tested?

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

#7841 is resolved by rahulsmahadev/historyEnableDefault.

GitOrigin-RevId: ee67cd36f6de63b86d46e46ad4f657d8741a877d
## What changes were proposed in this pull request?
 - WIP PR. Remaining changes are as follows:
   - In order to support adding comments on array/map columns and adding comments to primitive types on an array's element or a map's key/value, may need to use `transformColumn` or do something other than use `dropColumn` in its current form
 - Change `dropColumn` to support `ArrayType` and `MapType`

## How was this patch tested?
 - Tests added to `DeltaAlterTableTests`

Author: Pranav Anand <anandpranavv@gmail.com>

#7808 is resolved by pranavanand/pa-30-altertable-changecolumn.

GitOrigin-RevId: 4c0b02dc28b6119a77463327ecc36df89c42393a
## What changes were proposed in this pull request?

Moves the snapshot management piece in DeltaLog to a specific trait.  DeltaLog code is a bit leaner now.

## How was this patch tested?

This PR is a refactoring, passes existing tests.

Author: Ali Afroozeh <ali.afroozeh@databricks.com>

#7880 is resolved by dbaliafroozeh/IntroduceSnapshotEdge.

GitOrigin-RevId: a5b108c7f648c18b306ed64c5f26806f5ae914e6
GitOrigin-RevId: 8ac5d37ca3bfeb0509f5ca3c763eb4786f2f8df3
This commit adds the ID of the table being written to in schema mismatch error messages.

Closes #322

GitOrigin-RevId: 6523ca6eb81dd9ca12b9e97163fd520ff8dc839b
…verLogStore

Fixes #316

Closes #317

Author: Erik LaBianca <erik.labianca@gmail.com>

#7992 is resolved by zsxwing/8poe59z8.

GitOrigin-RevId: 4e2306940262b3f942a8c325f494f22693e874b1
## What changes were proposed in this pull request?

This PR refactors the newly introduced SnapshotManagement and MetadataGetter classes, as well as where checksums are verified. We were missing a case, where `getSnapshotAt` was not verifying checksums. The checksum verification has been moved into `MetadataGetter`, since this class is responsible for computing the statistics around a Delta table, and is the thin-waist where we can perform this check.

We also unify some code in SnapshotManagement, and also introduce a method for computing the initial state of a Delta table when it is first loaded. This avoids a two step:
 1. Create initial empty snapshot, because checkpoint doesn't exist
 2. Now do a list and update the table
initialization and consolidates them into one, reducing total file system operations

## How was this patch tested?

Existing tests as this is a pure refactor

Author: Burak Yavuz <brkyvz@gmail.com>

#8112 is resolved by brkyvz/checksumRefactor.

GitOrigin-RevId: 5ad11307c39c5444b2feacb0ad74a9cf89711b1d
Existing tests

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

GitOrigin-RevId: c8fd5fa6cd8625412a550a2d64a3a2e8878c22eb
Author: Burak Yavuz <brkyvz@gmail.com>

GitOrigin-RevId: 34d1a54965dc953b8b6518cb33501e574dcc0445
Author: herman <herman@databricks.com>
Author: Herman van Hovell <herman@databricks.com>

GitOrigin-RevId: 512d1d6930f8a45d4249a6ffb5c4a93fe0a6d1ae
…ation

## What changes were proposed in this pull request?
This PR makes the SNAPSHOT PARTITIONS configurations an optional configuration. This helps us understand when users override it.

## How was this patch tested?

Unit tests

[SC-24988]: https://databricks.atlassian.net/browse/SC-24988

Author: Ali Afroozeh <ali.afroozeh@databricks.com>

#8208 is resolved by dbaliafroozeh/AutotuneNumPartitions.

GitOrigin-RevId: 1cb47b4dbcf96a0b9ce900d2c18d9b2df5b61c7e
## What changes were proposed in this pull request?

Spark UI will track all normal accumulators along with Spark tasks to show them on Web UI. However, the accumulator used by `MergeIntoCommand` can store a very large value since it tracks all files that need to be rewritten. We should ask Spark UI to not remember it, otherwise, the UI data may consume lots of memory (It keeps 1000 stages by default. It means we will keep such accumulator until there are 1000 new stages completed).

Hence, we use the prefix `internal.metrics.` to make this accumulator become an internal accumulator, so that it will not be tracked by Spark UI.

Note: this doesn't fix the issue entirely. `org.apache.spark.status.LiveTask` is used to track all running or pending tasks. The internal accumulators are still stored in this class until the task completes (`LiveTask` will be converted to `TaskDataWrapper` and internal accumulators will be dropped in this step. See `org.apache.spark.status.LiveEntityHelpers#newAccumulatorInfos`). This is fine since the accumulator is needed when tasks are active. However, **if spark events get dropped**, we may leak task completion events and internal accumulators will stay with these `stale` tasks forever.

## How was this patch tested?

The new test to verify the accumulator is not tracked by Spark UI.

Author: Shixiong Zhu <zsxwing@gmail.com>

#8222 is resolved by zsxwing/SC-29560.

GitOrigin-RevId: f9cb1244dabaa1c3f6a2d5128dcc13a86a9a3e6d
…rtitionDir method

The main goal of this simple Pull Request is to provide a simple enhancement of the existing code by avoiding to use a mutable dataframe variable in the ```withRelativePartitionDir``` method.

Closes #330

Author: mahmoud mahdi <mahmoudmahdi24@gmail.com>

#8329 is resolved by mukulmurthy/df4pjj4s.

GitOrigin-RevId: 5fa6aa0612c5818ccf7ef30320aaddd028e5c948
…` if possible

There are several places we can use the public `Encoder` class rather than the private concrete class `ExpressionEncoder`. This PR just fixes them.

Author: IonutBoicuAms <ionut.boicu@databricks.com>

GitOrigin-RevId: c8f981ab8f1fbac8e5dc25a0bb62725841482ec7
Author: Burak Yavuz <brkyvz@gmail.com>

GitOrigin-RevId: 5ebd0dd61de7fabb1064a104e587dd1add15464d
… the DeltaSource

Adds support for the option maxBytesPerTrigger to the DeltaSource. This will help the Delta Streaming source to process a soft-max of maxBytesPerTrigger each micro-batch.

Unit tests

Author: Burak Yavuz <brkyvz@gmail.com>

GitOrigin-RevId: 2e6b2f09a444dd03ec968096ae29a2e3c4bb2186
…hether there is a sub query

Use Spark's SubqueryExpression.hasSubquery to check whether there is a sub query so that it's working even if Spark adds more types of sub queries in future.

Authored-by: maryannxue <maryann.xue@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
GitOrigin-RevId: 46a16f2eaad9bc8b31ee7c0013b397783c8c6b85
…ng schema

This PR adds a new error message for REPLACE table. We will use this for Spark 3.0. Adding this here to reduce conflicts when we merge the Spark 3.0 branch.

Author: Burak Yavuz <brkyvz@gmail.com>

GitOrigin-RevId: f1bdb229ea52c206563c60b449a54a83581a5809
## What changes were proposed in this pull request?
 - The issue with the existing tests which caused the ticket to be filed was that though the docs link was added correctly (not hard coded, used the generateDocsLink method), it wasn't explicitly added to the tests therefore was not tested
 - This PR aims to make that less likely to happen i.e. if the link was added correctly, it will be tested

## How was this patch tested?
 - Changed existing test to check all links

Author: Pranav Anand <anandpranavv@gmail.com>

#8217 is resolved by pranavanand/pa-fix-docs-mergelink.

GitOrigin-RevId: f0af7308ddd2214455f995fed0e505350b86a38a
…cters

## What changes were proposed in this pull request?

This PR adds the same invalid character checks made on data columns to partition columns as well. Having different rules for different physical types of columns violates data independence. Since there is a possibility that this may break existing workloads, we add a flag to disable the check.

## How was this patch tested?

Existing unit tests were retrofitted to perform this check

Author: Burak Yavuz <brkyvz@gmail.com>

#8599 is resolved by brkyvz/partColCheck.

GitOrigin-RevId: 4233f36e862056982c6245981567a7965033182c
…adoc

While digging into delta's code, I found a typing mistake in the javadoc of ```WriteIntoDelta``` which I fixed in this Pull Request.

Closes #334

Author: mahmoud mahdi <mahmoudmahdi24@gmail.com>

#8656 is resolved by tdas/rkqow8pb.

GitOrigin-RevId: 526bcf5c0b1238b7313161ad27cf41d4061d50c1
The main goal of this Pull Request is to make some code enhancements on the different Options used in this project.
Some pattern matching snippets were unnecessary since we could have treated options as collections.
for example :

- **foreach** can replace the following code:
```
option match {
  case None => {}
  case Some(x) => foo(x)
}
```
- **forall** can replace the following code:
```
option match {
  case None => true
  case Some(x) => foo(x)
}
```
- **exists** can replace the following code:
```
option match {
  case None => false
  case Some(x) => foo(x)
}
```

I also removed some unnecessary pattern matching conditions which could have been anticipated using a ```getOrElse``` on the Scala Option.

These modifications were tested by running the existant tests.

Closes #335

Author: mahmoud mahdi <mahmoudmahdi24@gmail.com>

#8661 is resolved by tdas/rvc8m6aq.

GitOrigin-RevId: 3cce52edf222df0a2f67813b8bd39b76f974c8fd
Shamelessly stolen from https://github.com/apache/spark/blob/master/dev/lint-python

Closes #214

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Wesley Hoffman <wesleyhoffman109@gmail.com>

GitOrigin-RevId: 95f460ba0a79f2271983cf55fcbb2d07da13c779
…atched clause #342

When Merge command doesn't have a whenNotMatched clause, the Full Outer Join in MergeIntoCommand.writeAllChanges can be changed to Right Outer Join. Since left/source side is usually small, this can enable Broadcast join - closes #342

Closes #343

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: ekoifman <ekoifman@apache.org>

GitOrigin-RevId: 0bf21ea6eaa30df303a9b1ac926833c38f294e0e
@JassAbidi JassAbidi merged commit 5fb6a77 into JassAbidi:master Apr 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.