Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catch up to master #3

Merged
merged 26 commits into from
Apr 23, 2020
Merged

catch up to master #3

merged 26 commits into from
Apr 23, 2020

Conversation

JassAbidi
Copy link
Owner

No description provided.

brkyvz and others added 26 commits April 7, 2020 13:12
## What changes were proposed in this pull request?

We fix the tableIds in all tests to break the assumption that all tableId's must be unique. Users may copy their tables to new locations resulting in duplicate tableIds. None of the logic within Delta code should assume that tables have unique ids.

## How was this patch tested?

A test that required a new table Id in DeltaSourceSuite actually failed, so this change actually works.

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

#8749 is resolved by brkyvz/tableId.

GitOrigin-RevId: 0ef93c1cb18829db315dec7746a7ed6a352e4d3f
…foundation guidelines

## What changes were proposed in this pull request?

Based on Linux Foundation guidelines, we are changing the Copyright in the license headers from "Databricks, Inc" to "The Delta Lake Project Authors"

## How was this patch tested?

Checked that no files have "Databricks, Inc" in them.

Author: Tathagata Das <tathagata.das1565@gmail.com>

#8792 is resolved by tdas/SC-30048.

GitOrigin-RevId: dc1cfe30c1780a0c3202bcd114fcc2c747cb72c3
…t of Default HDFSLogStore

## What changes were proposed in this pull request?
* Use FileSystem APIs for log reads instead of FileContext APIs. This was achieved by making the `HDFSLogStore` extend the `HadoopFileSystemLogStore`
* Throw a better error message for writes

closes #358

## How was this patch tested?
Added a couple of unit tests in the `HDFSLogStoreSuite` and a end to end test which reads a dataframe from a filesystem that doesnt implement AbstractFileSystem

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

#8733 is resolved by rahulsmahadev/fileSystemForLogReads.

GitOrigin-RevId: f5fdf83f63db1d81f0adc481c5e72ced520f1525
Fixes the support for CONVERT TO DELTA on tables stored in the HiveMetaStore. This feature will be open sourced after Spark 3.0 is released.

To ensure that the feature works, I had to refactor some of the test suites, so that the table based tests can run using Hive test utilities as well.

Unit tests

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

#8701 is resolved by brkyvz/hiveConvert.

GitOrigin-RevId: fe3057f52c2c85053a69585e7fd7e26b2c240642
This PR updates our docs to discuss the new process for accepting commits under the Linux foundation.

Closes #337

Co-authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

Author: Michael Armbrust <michael@databricks.com>

#8924 is resolved by tdas/2zm6hc96.

GitOrigin-RevId: a5f125b77c6d7d4cf790de4163e04e1bed93023c
…eltas

## What changes were proposed in this pull request?

Currently, a checkpoint must have delta files listed before it in the directory to be considered recoverable. This has a fencepost error: If there's only one recoverable checkpoint, and all deltas before it have been aged out, we'll report that the checkpoint isn't recoverable even if all deltas from that checkpoint onwards are available. This PR fixes the issue by advancing the check one iteration forwards.

## How was this patch tested?

new unit test

Author: Jose Torres <joseph.torres@databricks.com>

#8868 is resolved by jose-torres/fixthebug2.

GitOrigin-RevId: 044b81fd9973b1d3c90d87cfb7891b994c0655f7
Minor refactoring in the imports for GenerateSymlinkManifest.scala

GitOrigin-RevId: ef036ca9eae859797d828a604818b9ca24d966f2
… in getFilesForUpdate

## What changes were proposed in this pull request?

Fixes a bug where delta files older than the latest checkpoint is returned as part of `getFilesForUpdate`.

## How was this patch tested?

Regression test

Author: Burak Yavuz <brkyvz@gmail.com>

#9060 is resolved by brkyvz/fixDeltas.

GitOrigin-RevId: e4823726dc9c4b78c1672cbdea16b2b1988df017
Minor refactoring

Author: Tathagata Das <tathagata.das1565@gmail.com>

GitOrigin-RevId: 91f0ae9c7e82a153f812e0e9c94b0dbf7532e1df
… to decide whether to use rename

Add tests to ensure checkpoint uses `isPartialWriteVisible` to decide whether to use rename.

The new unit test.

Author: Shixiong Zhu <zsxwing@gmail.com>

GitOrigin-RevId: b72ebec9695aaabbc156ea27fe7d4077ce4d0fe1
…ge to reduce the number of files

## What changes were proposed in this pull request?
Added a DeltaSQLConf to allow repartition the merge output dataframe by partition columns to reduce the number of files.

closes #367

## How was this patch tested?
Added unit tests in MergeIntoCommandSuiteBase

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

#8890 is resolved by rahulsmahadev/mergeLessFiles.

GitOrigin-RevId: 58ba990cb0bf05a93d78d18e0285b795e6014a24
Added a new type of message to throw on errors related to invalid options.

Author: Burak Yavuz <brkyvz@gmail.com>

GitOrigin-RevId: d4133f79d0a52636d47665c49f6105ee8c00d8cd
Implement MERGE INTO schema evolution

new unit tests

Author: Jose Torres <joseph.torres@databricks.com>

GitOrigin-RevId: 633dbcdd90143c8551982d97e48fffbca1dbb073
- Metadata checks for detecting table id changes, these checks will generate logs
- Extra log4j logs to help debug metadata issues

Closes #9221 from tdas/SC-31462.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
GitOrigin-RevId: c6e63057acb5c712ebefdb28d88df50bdbd875da
This PR removes the re-use of the previous Snapshot's state for the next Snapshot's computation. This logic adds some complexity around having to keep track of RDD lineage to avoid stack overflows, and requiring us to have 2 different Snapshot computation methods when we're trying to load a given version of the table versus just the latest version. Instead we will now perform file listing from the latest known checkpoint at all times and give the full set of files required to build a Snapshot.

The only advantage of reusing the state from the previous snapshot was that it would avoid a cost of hitting the storage system for the existing data. We ran some benchmarks to show that this cost is minimal and should be dwarfed by the time it takes to actually perform ETL.

 - Write a very small commit to a large table (500,000 actions in checkpoint, 106 MB in size) + 8 delta files (100 actions per commit) (This should be a pathological case, because we need to hit the storage system every commit)

Old code path: 1.7 seconds
New code path: 8.1 seconds

 - How long it takes to run a rateStream with 10 rows per second to a Delta table (effects on latency) for 100 batches:

Old code path: 5.3 minutes
New code path: 4.6 minutes
^^ I honestly don't know how it got better. I would've expected it to get slower above. Could be variability in Cloud instance performance

These numbers should be dwarfed by the time to actually write the full data, therefore it seems like a worthy compromise.

Author: Burak Yavuz <brkyvz@gmail.com>

GitOrigin-RevId: afe702686df982766794f498969761c320c42e42
…s don't get filtered out

## What changes were proposed in this pull request?
Fixed a bug which caused the stats from `BasicFileStatsTracker` to not be filtered out

## How was this patch tested?
Added a unit test which basically tests if the captured metrics are the ones defined in the schema.

Author: Rahul Mahadev <rahul.mahadev@databricks.com>

#8847 is resolved by rahulsmahadev/historyFixExtra.

GitOrigin-RevId: 35719c253a035445dbbf2376ddf210e55e5f0adc
Avoid flakiness by cleaning up scope of view in test

Closes #8890 from rahulsmahadev/mergeLessFiles.

Authored-by: Rahul Mahadev <rahul.mahadev@databricks.com>
Signed-off-by: Tathagata Das <tdas@databricks.com>
GitOrigin-RevId: 740e7685d0570af007fae325e72f9cac0247740a
…alidation

## What changes were proposed in this pull request?

This PR reorganizes some of the information passed into Snapshot and MetadataGetter to reduce the complexity of the implementations. Checksum validation is something that the Snapshot performs instead of the MetadataGetter now. We also get rid of the MetadataGetter interface, as it doesn't really add any value.

## How was this patch tested?

Existing unit tests

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

#9061 is resolved by brkyvz/refValidation.

GitOrigin-RevId: 437d0ffbcd62c6253422a902d5a6c6e886fbf147
After looking at failures, I found that the root cause is `enableExpiredLogCleanup` is not set to `false` in some tests and the automatic log cleanup gets triggered. This flag is not flipped because these tests don't have any `Metadata` in the commit logs. `spark.databricks.delta.properties.defaults.enableExpiredLogCleanup` is picked up only if there is a `Metadata` action committed.

This PR adds `startTxnWithManualLogCleanup` and uses it to do the first commit to make sure we disable `enableExpiredLogCleanup` correctly.

Author: Shixiong Zhu <zsxwing@gmail.com>

GitOrigin-RevId: d136dd842b29e165bfba8dfe079094d00e8cb5db
## What changes were proposed in this pull request?

DeltaLogging is adding unnecessary methods in the API docs of DeltaMergeBuilder.

Author: Tathagata Das <tathagata.das1565@gmail.com>

#9345 is resolved by tdas/SC-32755.

GitOrigin-RevId: 15314baff52247bb8e71f0ef600176e186428be2
The Analyzer resolves timestamp expressions, by using the session local time-zone during Analysis. Sometimes we were prematurely resolving TimeTravel nodes, which caused
```scala
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: java.util.NoSuchElementException: None.get

at scala.None$.get(Option.scala:347)
```

This PR adds a resolution check on timestamp expressions. The main bug existed for tables that were accessed through paths, e.g. ```delta.`/some/path` ```. Queries that accessed tables directly through names worked fine.

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

GitOrigin-RevId: b311691fc6202f3fa1ed4a3e4b8e4197a28198f5
…tive session is not changed (master)

This PR adds DatasetRefCache to cache Dataset reference when the active session is not changed to avoid the overhead of Dataset creation. DatasetRefCache will cache the Dataset reference and automatically create a new one when the active session changes.

Jenkins

Closes #9403 from zsxwing/SC-31106-master.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
GitOrigin-RevId: 5d73ae000f0f4740b232685134d85cf5d74e17a6
…la API.

## What changes were proposed in this pull request?

In the scala API, when we create a DeltaTable object, we reuse the same dataframe forever. This isn't correct, and can lead to errors if the delta table's schema has changed between DeltaTable initialization and the time of executing an action such as merge. We should instead recreate it from the current state of the delta table when executing actions.

## How was this patch tested?

new unit test

Author: Jose Torres <joseph.torres@databricks.com>

#9382 is resolved by jose-torres/fixmerge.

GitOrigin-RevId: 70c28441197328ad89398c936d0efd95bfa87fd9
…rtitioned

It is PR for issue with removing partitions in delta table.
It was discussed here https://delta-users.slack.com/archives/CJ70UCSHM/p1587048581235000
And i implemened proposed solution https://delta-users.slack.com/archives/CJ70UCSHM/p1587068793244400

Closes #390

Co-authored-by: hleb.lizunkou <hleb.lizunkoui@coxautoinc.com>
Signed-off-by: Burak Yavuz <brkyvzgmail.com>
GitOrigin-RevId: eee5573959bc9827f1d381133cde45685f8dbee4
@JassAbidi JassAbidi merged commit 4ec9fec into JassAbidi:master Apr 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants