-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
safer use of "/dbfs" #1931
Closed
Closed
safer use of "/dbfs" #1931
Changes from all commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
0941892
safer use of "/dbfs"
mle-els 350d4ef
fix broken link (#1950)
noklam 3aa25d2
Update dependabot.yml config (#1938)
SajidAlamQB 2b10303
Update setup.py Jinja2 dependencies (#1954)
noklam 7972d36
Update pip-tools requirement from ~=6.5 to ~=6.9 in /dependency (#1957)
dependabot[bot] d3bdbbe
Update toposort requirement from ~=1.5 to ~=1.7 in /dependency (#1956)
dependabot[bot] 59895ea
Add deprecation warning to package_name argument in session create() …
merelcht b280494
Remove redundant `resolve_load_version` call (#1911)
noklam ba546f9
Make docstring in test starter match real starters (#1916)
deepyaman 9fc83cd
Add show-docs command to Makefile (#1959)
stichbury 1730e9c
Enable `TensorFlowModelDataset` to overwrite existing model, and add …
williamcaicedo 723cb2d
make catching narrower
mle-els 411b145
safer use of "/dbfs"
mle-els f4e6710
make catching narrower
mle-els 00e90bb
safer use of "/dbfs"
mle-els 5bad159
make catching narrower
mle-els 6c744e7
Merge branch 'patch-1' of github.com:mle-els/kedro into patch-1
merelcht 9d2f8c1
Merge branch 'main' into patch-1
merelcht 53182a5
add test, release note
mle-els f4088d6
Merge branch 'patch-1' of github.com:mle-els/kedro into patch-1
mle-els 6e6da4b
Fix lint
merelcht c610485
Merge branch 'main' into patch-1
merelcht File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -614,6 +614,16 @@ def test_dbfs_exists(self, mocker): | |
dbutils_mock.fs.ls.side_effect = Exception() | ||
assert not _dbfs_exists(test_path, dbutils_mock) | ||
|
||
def test_ds_init_get_dbutils_raises_exception(self, mocker): | ||
get_dbutils_mock = mocker.Mock() | ||
get_dbutils_mock.side_effect = AttributeError | ||
get_dbutils_mock = mocker.patch( | ||
"kedro.extras.datasets.spark.spark_dataset._get_dbutils", get_dbutils_mock | ||
) | ||
|
||
data_set = SparkDataSet(filepath="/dbfs/tmp/data") | ||
assert data_set._glob_function.__name__ == "iglob" | ||
|
||
Comment on lines
+617
to
+626
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The test and the assertion don't seem to match here. This would be obsoleted if the |
||
def test_ds_init_no_dbutils(self, mocker): | ||
get_dbutils_mock = mocker.patch( | ||
"kedro.extras.datasets.spark.spark_dataset._get_dbutils", return_value=None | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the root cause entirely. Is this a bug from Databricks
pyspark.dbutils
module or is it because we check the filepath too eagerly in kedro?The
_get_dbutils
function is suppose to try getting the dbutils aggressively, if not it just returnNone
. This solution is adding yet another try-catch layer outside is a bit hacky but maybe necessary in this case? I want to make sure I understand the problem before I come to the conclusion.Is it better to have this try-except block inside the
_get_dbutils
func if necessary?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's a bug in DataBricks code. It assumes that
IPython.get_ipython()
returns an object. When it happens to return None, we get anAttributeError
.I think having the try-except block inside
_get_dbutils
is a better solution indeed. Thanks for pointing that out!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this
pyspark.dbutils
module only available on Databricks runtime? If so I think that's why it assumes you haveIPython
. Also, you mentioned you are running on Databricks but not in a managed way, I am not aware that there is an on-premise option, what kind of environment are you running on?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to run on a normal DataBricks cluster, just with MLFlow instead of a notebook. I managed to run pipelines via a notebook too but it would have been better to do it through command line. So, when I run
mlflow run
, MLFlow packages my project into a zip file, sends to a new DataBricks cluster, and runs it on there. Apparently, because it's not on a notebook, there's no IPython.If you think that this use case is worth it to support, I can make the change that you proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mle-els I prefer moving the try-except block to
_get_dbutils
. ForIPython
I am unsure, even running with.py
file it will haveIPython
normally, but Databricks doesn't document.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mle-els I won't be able to test it myself since I don't have the environment configured. My guess will be you have a relatively old Databricks runtime.
We test it recently with
dbx
, which package up a project and runs as a Databricks Job, and IPython would be available in that case.This suggest Databricks runtime >11 always run on IPython, although it mentioned notebook only but as we tested a couple months ago, it's the same with .py file.
https://docs.databricks.com/notebooks/ipython-kernel.html#how-to-use-the-ipython-kernel-with-databricks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try running the code on a newer runtime when I find some free time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @mle-els We'd like to get all PRs related to datasets to be merged soon now we're moving our datasets code to a different package (see our medium blog post for more details).
Do you think you can find time this week? Otherwise, we'll close this PR and ask you to re-open it on the new repo when it's ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This week I'm swamped, unfortunately :( Please feel free to close it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mle-els No worries! Feel free to re-open the PR in the
kedro-plugins
repository when you are free to work on it again. :)