-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KED-1458] Versioning extremely slow on DBFS #275
Comments
Line-by-line profiling of
|
Improving
|
Between #275 (comment) and #275 (comment), the time spent on line 124 has disappeared! The remaining bottleneck is line 92. However, since we've patched the code that creates |
This is a great writeup, thanks for such a detailed issue 👌 |
Fixed in merge commit: 6bf1066 |
Description
Our pipeline running on Azure Databricks has gotten progressively slower. Somebody noticed that running on a fresh set of paths (without so many versions) was significantly faster. Further investigation yielded that it wasn't because of the data itself; instead, it's finding the existing versions that's prohibitively slow. Specifically, underlying functions used by
iglob
(likeos.scandir
) are much slower than their DBFS-native counterparts (e.g.dbutils.fs.ls
).Context
Pipelines that should take less than 2 hours are taking 3-5 times that.
Steps to Reproduce
On a Databricks cluster:
SparkDataSet
.Expected Result
Time taken for Step 4 should remain quite similar to that for Step 2.
Actual Result
Time explodes. 🌋
Possible Implementation
See
DBFSDirEntry
,_get_dbutils
,_dbfs_scandir
, and patches below:I believe it's fine to have DBFS-specific code, as we already do. However, a few necessary enhancements:
DATABRICKS_RUNTIME_VERSION
environment variable) and apply patches conditionally.SparkDataSet
, when the filesystem is DBFS (versioning should be the same). => Actually implement this code as part of the versioning mix-in?Possible Alternatives
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro
orkedro -V
):0.15.5
python -V
): Python 3.7.3The text was updated successfully, but these errors were encountered: