Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-1466] SparkHiveDataSet is slow to initialize in Databricks #279

Closed
MigQ2 opened this issue Mar 9, 2020 · 1 comment
Closed

[KED-1466] SparkHiveDataSet is slow to initialize in Databricks #279

MigQ2 opened this issue Mar 9, 2020 · 1 comment
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed

Comments

@MigQ2
Copy link
Contributor

MigQ2 commented Mar 9, 2020

Description

I'm using SparkHiveDataSet to connect with some Databricks tables and I have seen that the code in SparkHiveDataSet.__init__() loads the dataset to find the columns it has and checks if the table already exists. This triggers a short spark action which takes little time (~5 seconds) for each dataset when the catalog is initialized. Therefore, if I have hundreds of SparkHiveDataSets in my catalog initializing a kedro ipython session takes several minutes even if I just want to do a quick analysis with one dataset.

By reading the source code I have seen that the actions come from the following methods, which are invoked when a SparkHiveDataSet is initialized:

  • SparkHiveDataSet._exists()
  • SparkHiveDataSet._load()

Steps to Reproduce

  1. Create many SparkHiveDataSet in the catalog that point to tables declared in Databricks
  2. Initialize the kedro catalog

Expected Result

Kedro context and the catalog get initialized quickly (<30 seconds)

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro 0.15.5
  • Python 3.7
  • Databricks connect 6.4
  • Windows 10

Possible solution

Don't trigger any spark action to inspect the schema when initializing the dataset, just when it needs to be accessed with _load() or _save(). Also caching the result could help speeding things up if the dataset is used many times in a single kedro run.

@MigQ2 MigQ2 added the Issue: Bug Report 🐞 Bug that needs to be fixed label Mar 9, 2020
MigQ2 added a commit to MigQ2/kedro that referenced this issue Mar 10, 2020
MigQ2 added a commit to MigQ2/kedro that referenced this issue Mar 10, 2020
MigQ2 added a commit to MigQ2/kedro that referenced this issue Mar 10, 2020
MigQ2 added a commit to MigQ2/kedro that referenced this issue Mar 11, 2020
MigQ2 added a commit to MigQ2/kedro that referenced this issue Mar 11, 2020
@lorenabalan lorenabalan changed the title SparkHiveDataSet is slow to initialize in Databricks [KED-1466] SparkHiveDataSet is slow to initialize in Databricks Mar 13, 2020
@andrii-ivaniuk
Copy link
Contributor

Fixed in 8dc70bc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed
Projects
None yet
Development

No branches or pull requests

2 participants