[KED-1466] SparkHiveDataSet is slow to initialize in Databricks #279

MigQ2 · 2020-03-09T13:33:11Z

Description

I'm using SparkHiveDataSet to connect with some Databricks tables and I have seen that the code in SparkHiveDataSet.__init__() loads the dataset to find the columns it has and checks if the table already exists. This triggers a short spark action which takes little time (~5 seconds) for each dataset when the catalog is initialized. Therefore, if I have hundreds of SparkHiveDataSets in my catalog initializing a kedro ipython session takes several minutes even if I just want to do a quick analysis with one dataset.

By reading the source code I have seen that the actions come from the following methods, which are invoked when a SparkHiveDataSet is initialized:

SparkHiveDataSet._exists()
SparkHiveDataSet._load()

Steps to Reproduce

Create many SparkHiveDataSet in the catalog that point to tables declared in Databricks
Initialize the kedro catalog

Expected Result

Kedro context and the catalog get initialized quickly (<30 seconds)

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro 0.15.5
Python 3.7
Databricks connect 6.4
Windows 10

Possible solution

Don't trigger any spark action to inspect the schema when initializing the dataset, just when it needs to be accessed with _load() or _save(). Also caching the result could help speeding things up if the dataset is used many times in a single kedro run.

The text was updated successfully, but these errors were encountered:

…rg#279)

andrii-ivaniuk · 2020-05-21T12:04:54Z

Fixed in 8dc70bc

MigQ2 added the Issue: Bug Report 🐞 Bug that needs to be fixed label Mar 9, 2020

MigQ2 added a commit to MigQ2/kedro that referenced this issue Mar 10, 2020

Sped up initialization of SparkHiveDataSet (kedro-org#279)

a70faea

MigQ2 added a commit to MigQ2/kedro that referenced this issue Mar 10, 2020

Made changes consistent in both versions of SparkHiveDataSet (kedro-o…

e9e32f0

…rg#279)

MigQ2 mentioned this issue Mar 10, 2020

[KED-1466] Sped up initialization of SparkHiveDataSet #281

Closed

2 tasks

MigQ2 added a commit to MigQ2/kedro that referenced this issue Mar 10, 2020

Fixed linting (kedro-org#279)

f9e2c81

MigQ2 added a commit to MigQ2/kedro that referenced this issue Mar 10, 2020

Fixed linting (kedro-org#279)

e18f9ad

MigQ2 added a commit to MigQ2/kedro that referenced this issue Mar 11, 2020

Updated SparkHiveDataSet tests (kedro-org#279)

a530e8e

MigQ2 added a commit to MigQ2/kedro that referenced this issue Mar 11, 2020

Fixed unit tests (kedro-org#279)

4f0b8be

lorenabalan changed the title ~~SparkHiveDataSet is slow to initialize in Databricks~~ [KED-1466] SparkHiveDataSet is slow to initialize in Databricks Mar 13, 2020

andrii-ivaniuk closed this as completed May 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KED-1466] SparkHiveDataSet is slow to initialize in Databricks #279

[KED-1466] SparkHiveDataSet is slow to initialize in Databricks #279

MigQ2 commented Mar 9, 2020

andrii-ivaniuk commented May 21, 2020

[KED-1466] SparkHiveDataSet is slow to initialize in Databricks #279

[KED-1466] SparkHiveDataSet is slow to initialize in Databricks #279

Comments

MigQ2 commented Mar 9, 2020

Description

Steps to Reproduce

Expected Result

Your Environment

Possible solution

andrii-ivaniuk commented May 21, 2020