Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safer use of "/dbfs" #1931

Closed
wants to merge 22 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
0941892
safer use of "/dbfs"
mle-els Oct 13, 2022
350d4ef
fix broken link (#1950)
noklam Oct 18, 2022
3aa25d2
Update dependabot.yml config (#1938)
SajidAlamQB Oct 19, 2022
2b10303
Update setup.py Jinja2 dependencies (#1954)
noklam Oct 19, 2022
7972d36
Update pip-tools requirement from ~=6.5 to ~=6.9 in /dependency (#1957)
dependabot[bot] Oct 19, 2022
d3bdbbe
Update toposort requirement from ~=1.5 to ~=1.7 in /dependency (#1956)
dependabot[bot] Oct 19, 2022
59895ea
Add deprecation warning to package_name argument in session create() …
merelcht Oct 19, 2022
b280494
Remove redundant `resolve_load_version` call (#1911)
noklam Oct 20, 2022
ba546f9
Make docstring in test starter match real starters (#1916)
deepyaman Oct 20, 2022
9fc83cd
Add show-docs command to Makefile (#1959)
stichbury Oct 21, 2022
1730e9c
Enable `TensorFlowModelDataset` to overwrite existing model, and add …
williamcaicedo Oct 22, 2022
723cb2d
make catching narrower
mle-els Oct 22, 2022
411b145
safer use of "/dbfs"
mle-els Oct 13, 2022
f4e6710
make catching narrower
mle-els Oct 22, 2022
00e90bb
safer use of "/dbfs"
mle-els Oct 13, 2022
5bad159
make catching narrower
mle-els Oct 22, 2022
6c744e7
Merge branch 'patch-1' of github.com:mle-els/kedro into patch-1
merelcht Nov 7, 2022
9d2f8c1
Merge branch 'main' into patch-1
merelcht Nov 7, 2022
53182a5
add test, release note
mle-els Nov 9, 2022
f4088d6
Merge branch 'patch-1' of github.com:mle-els/kedro into patch-1
mle-els Nov 9, 2022
6e6da4b
Fix lint
merelcht Nov 9, 2022
c610485
Merge branch 'main' into patch-1
merelcht Nov 9, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
* Updated MatplotlibWriter Dataset docs with working examples.
* Modified implementation of the Kedro IPython extension to use `local_ns` rather than a global variable.
* Refactored `ShelveStore` to it's own module to ensure multiprocessing works with it.
* Fixed `AttributeError` when using `/dbfs` paths on an unsupported environment

## Minor breaking changes to the API

Expand Down
8 changes: 7 additions & 1 deletion kedro/extras/datasets/spark/spark_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,13 @@ def __init__( # pylint: disable=too-many-arguments
path = PurePosixPath(filepath)

if filepath.startswith("/dbfs"):
dbutils = _get_dbutils(self._get_spark())
dbutils = None
try:
dbutils = _get_dbutils(self._get_spark())
except AttributeError:
# Databricks is known to raise AttributeError when called
# on an unsupported environment
pass
Comment on lines +312 to +318
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the root cause entirely. Is this a bug from Databricks pyspark.dbutils module or is it because we check the filepath too eagerly in kedro?

The _get_dbutils function is suppose to try getting the dbutils aggressively, if not it just return None. This solution is adding yet another try-catch layer outside is a bit hacky but maybe necessary in this case? I want to make sure I understand the problem before I come to the conclusion.

Is it better to have this try-except block inside the _get_dbutils func if necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's a bug in DataBricks code. It assumes that IPython.get_ipython() returns an object. When it happens to return None, we get an AttributeError.

/databricks/spark/python/pyspark/dbutils.py:50 in get_dbutils                │
│                                                                              │
│    47 │   │   │   return SparkServiceClientDBUtils(spark.sparkContext)       │
│    48 │   │   else:                                                          │
│    49 │   │   │   import IPython                                             │
│ ❱  50 │   │   │   return IPython.get_ipython().user_ns["dbutils"]            │
│    51                                                                        │
│    52                                                                        │
│    53 class SparkServiceClientDBUtils(object):                               │

I think having the try-except block inside _get_dbutils is a better solution indeed. Thanks for pointing that out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pyspark.dbutils module only available on Databricks runtime? If so I think that's why it assumes you have IPython. Also, you mentioned you are running on Databricks but not in a managed way, I am not aware that there is an on-premise option, what kind of environment are you running on?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to run on a normal DataBricks cluster, just with MLFlow instead of a notebook. I managed to run pipelines via a notebook too but it would have been better to do it through command line. So, when I run mlflow run, MLFlow packages my project into a zip file, sends to a new DataBricks cluster, and runs it on there. Apparently, because it's not on a notebook, there's no IPython.

If you think that this use case is worth it to support, I can make the change that you proposed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mle-els I prefer moving the try-except block to _get_dbutils. For IPython I am unsure, even running with .py file it will have IPython normally, but Databricks doesn't document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mle-els I won't be able to test it myself since I don't have the environment configured. My guess will be you have a relatively old Databricks runtime.

We test it recently with dbx, which package up a project and runs as a Databricks Job, and IPython would be available in that case.

This suggest Databricks runtime >11 always run on IPython, although it mentioned notebook only but as we tested a couple months ago, it's the same with .py file.
https://docs.databricks.com/notebooks/ipython-kernel.html#how-to-use-the-ipython-kernel-with-databricks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try running the code on a newer runtime when I find some free time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @mle-els We'd like to get all PRs related to datasets to be merged soon now we're moving our datasets code to a different package (see our medium blog post for more details).

Do you think you can find time this week? Otherwise, we'll close this PR and ask you to re-open it on the new repo when it's ready.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This week I'm swamped, unfortunately :( Please feel free to close it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mle-els No worries! Feel free to re-open the PR in the kedro-plugins repository when you are free to work on it again. :)

if dbutils:
glob_function = partial(_dbfs_glob, dbutils=dbutils)
exists_function = partial(_dbfs_exists, dbutils=dbutils)
Expand Down
10 changes: 10 additions & 0 deletions tests/extras/datasets/spark/test_spark_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -614,6 +614,16 @@ def test_dbfs_exists(self, mocker):
dbutils_mock.fs.ls.side_effect = Exception()
assert not _dbfs_exists(test_path, dbutils_mock)

def test_ds_init_get_dbutils_raises_exception(self, mocker):
get_dbutils_mock = mocker.Mock()
get_dbutils_mock.side_effect = AttributeError
get_dbutils_mock = mocker.patch(
"kedro.extras.datasets.spark.spark_dataset._get_dbutils", get_dbutils_mock
)

data_set = SparkDataSet(filepath="/dbfs/tmp/data")
assert data_set._glob_function.__name__ == "iglob"

Comment on lines +617 to +626
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test and the assertion don't seem to match here. This would be obsoleted if the try-except is moved to _get_dbutils too, so it would need some modification.

def test_ds_init_no_dbutils(self, mocker):
get_dbutils_mock = mocker.patch(
"kedro.extras.datasets.spark.spark_dataset._get_dbutils", return_value=None
Expand Down