-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KED-1466] SparkHiveDataSet is slow to initialize in Databricks #279
Labels
Issue: Bug Report 🐞
Bug that needs to be fixed
Comments
MigQ2
added a commit
to MigQ2/kedro
that referenced
this issue
Mar 10, 2020
MigQ2
added a commit
to MigQ2/kedro
that referenced
this issue
Mar 10, 2020
2 tasks
MigQ2
added a commit
to MigQ2/kedro
that referenced
this issue
Mar 10, 2020
MigQ2
added a commit
to MigQ2/kedro
that referenced
this issue
Mar 10, 2020
MigQ2
added a commit
to MigQ2/kedro
that referenced
this issue
Mar 11, 2020
MigQ2
added a commit
to MigQ2/kedro
that referenced
this issue
Mar 11, 2020
lorenabalan
changed the title
SparkHiveDataSet is slow to initialize in Databricks
[KED-1466] SparkHiveDataSet is slow to initialize in Databricks
Mar 13, 2020
Fixed in 8dc70bc |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
I'm using
SparkHiveDataSet
to connect with some Databricks tables and I have seen that the code inSparkHiveDataSet.__init__()
loads the dataset to find the columns it has and checks if the table already exists. This triggers a short spark action which takes little time (~5 seconds) for each dataset when the catalog is initialized. Therefore, if I have hundreds ofSparkHiveDataSet
s in my catalog initializing akedro ipython
session takes several minutes even if I just want to do a quick analysis with one dataset.By reading the source code I have seen that the actions come from the following methods, which are invoked when a
SparkHiveDataSet
is initialized:SparkHiveDataSet._exists()
SparkHiveDataSet._load()
Steps to Reproduce
SparkHiveDataSet
in the catalog that point to tables declared in DatabricksExpected Result
Kedro context and the catalog get initialized quickly (<30 seconds)
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
Possible solution
Don't trigger any spark action to inspect the schema when initializing the dataset, just when it needs to be accessed with
_load()
or_save()
. Also caching the result could help speeding things up if the dataset is used many times in a single kedro run.The text was updated successfully, but these errors were encountered: