Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warning when SparkDataSet is used on Databricks without a valid file path #114

Merged
merged 27 commits into from
Mar 6, 2023
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
6e5373e
Add databricks deployment check and automatic DBFS path addition
jmholzer Feb 3, 2023
bdb6958
Add newline at end of file
jmholzer Feb 3, 2023
50c72d3
Remove spurious 'not'
jmholzer Feb 6, 2023
6fc8ca6
Move dbfs utility functions from SparkDataSet
jmholzer Feb 6, 2023
868913c
Add edge case logic to _build_dbfs_path
jmholzer Feb 6, 2023
85c5862
Add test for dbfs path construction
jmholzer Feb 6, 2023
7f5bcda
Linting
jmholzer Feb 6, 2023
f6f60c1
Remove spurious print statement :)
jmholzer Feb 6, 2023
da9bb60
Add pylint disable too-many-public-methods
jmholzer Feb 6, 2023
8d4a7f2
Merge branch 'main' into fix/versioned-spark-dataset
jmholzer Feb 7, 2023
3d5116c
Move tests into single method to appease linter
jmholzer Feb 15, 2023
d43a53f
Modify prefix check to /dbfs/
jmholzer Feb 15, 2023
ee43277
Modify prefix check to /dbfs/
jmholzer Feb 15, 2023
8390ef7
Merge branch 'fix/versioned-spark-dataset' of github.com:kedro-org/ke…
jmholzer Feb 15, 2023
271c76c
Make warning message clearer
jmholzer Feb 15, 2023
c1cce42
Add release note
jmholzer Feb 15, 2023
40b90f7
Fix linting
jmholzer Feb 15, 2023
086fa08
Update warning message
jmholzer Feb 15, 2023
307f8c4
Modify log warning level to error
jmholzer Feb 15, 2023
d3a0a80
Modify message back to warning, refer to undefined behaviour
jmholzer Feb 16, 2023
c349fa8
Modify required prefix to /dbfs/
jmholzer Feb 16, 2023
d882b7b
Modify doc string
jmholzer Feb 16, 2023
44b16f5
Modify warning message
jmholzer Feb 16, 2023
017d3fa
Split tests and add filepath to warning
jmholzer Feb 20, 2023
b448297
Modify f string in logging call
jmholzer Feb 20, 2023
37d1bdd
Fix tests
jmholzer Feb 20, 2023
8c5f92c
Lint
jmholzer Feb 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions kedro-datasets/RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@

# Upcoming Release:

## Bug fixes and other changes
* Added a warning when the user tries to use `SparkDataSet` on Databricks without specifying a file path with the `/dbfs/` prefix.

# Release 1.0.2:

## Bug fixes and other changes
Expand Down
24 changes: 23 additions & 1 deletion kedro-datasets/kedro_datasets/spark/spark_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
``pyspark``
"""
import json
import logging
import os
from copy import deepcopy
from fnmatch import fnmatch
from functools import partial
Expand All @@ -23,6 +25,8 @@
from pyspark.sql.utils import AnalysisException
from s3fs import S3FileSystem

logger = logging.getLogger(__name__)


def _parse_glob_pattern(pattern: str) -> str:
special = ("*", "?", "[")
Expand Down Expand Up @@ -114,6 +118,20 @@ def _dbfs_exists(pattern: str, dbutils: Any) -> bool:
return False


def _deployed_on_databricks() -> bool:
"""Check if running on Databricks."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should add a function to Kedro in a separate PR and import that here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes since they are doing the exact same thing, we should just keep the logic consistent. If this is a more pressing bug fix I think it's fine to keep this in the PR but we should still refactor it in kedro later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I am generally very much pro-DRY, in this case I would prefer we just kept this duplicated since...

  • it's not clear where such a function would live, and presumably it would be a private function, in which case kedro-datasets would become coupled to a private function in kedro core
  • it's a very small and simple piece of logic which is unlikely to change any time soon
  • the "check if we are on databricks" logic in kedro core is almost definitely going to move anyway as part of Introduce our own rich console logging handler kedro#2277

Overall, I'd go for the principle that "duplication is cheaper than getting it wrong" here and so just leave it duplicated.

return "DATABRICKS_RUNTIME_VERSION" in os.environ


def _path_has_dbfs_prefix(path: str) -> bool:
"""Check if a file path has a valid dbfs prefix.

Args:
path: File path to check.
"""
return path.startswith("/dbfs/")
antonymilne marked this conversation as resolved.
Show resolved Hide resolved


class KedroHdfsInsecureClient(InsecureClient):
"""Subclasses ``hdfs.InsecureClient`` and implements ``hdfs_exists``
and ``hdfs_glob`` methods required by ``SparkDataSet``"""
Expand Down Expand Up @@ -304,7 +322,11 @@ def __init__( # pylint: disable=too-many-arguments

else:
path = PurePosixPath(filepath)

if _deployed_on_databricks() and not _path_has_dbfs_prefix(filepath):
logger.error(
"Using SparkDataSet on Databricks without the `/dbfs` prefix in the "
"filepath raises an error. Add this prefix to fix the error."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a DataSetError since this is considered an invalid config? What kind of error are we getting now, is it some FileNotFound error due to unexpected path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A DataSetError gets raised on trying to save with versioning enabled, but it is not clear from the message why this is. This message is supposed to inform the user. Do you think we should raise an error with a clearer message instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think if this is an invalid code path and we know that it's going to fail we should raise the error earlier, what's the point of delaying until later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this back to a warning in d3a0a80 and refer to 'undefined behaviour' instead.

I also modified the doc string to make it clearer to the users that they need to specify this /dbfs prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think if this is an invalid code path and we know that it's going to fail we should raise the error earlier, what's the point of delaying until later?

Ah sorry, I didn't see this reply before I wrote my last message.

I would agree with you but there is one subtlety, which is that if you specify a version in SparkDataSet, file path resolution will work. I wouldn't like to raise an error here as this would be a breaking change for the users who do specify a version. Currently, it is only the automatic resolution of the latest version that is broken.

The alternative would be to add another condition to the if statement, checking whether a version has been specified and only raising an error in the case that it wasn't. However, I don't like this, as it leads the user into thinking it is OK to leave out the /dbfs/ prefix when a version is specified. As far as I understand, they should always be using this prefix. Therefore, I would rather display a warning whenever a SparkDataSet is invoked incorrectly in every case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, in that case I agree we should not break the existing behaviors and I think the current solution with warning is fine.

if filepath.startswith("/dbfs"):
dbutils = _get_dbutils(self._get_spark())
if dbutils:
Expand Down
21 changes: 21 additions & 0 deletions kedro-datasets/tests/spark/test_spark_dataset.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# pylint: disable=too-many-lines
import re
import sys
import tempfile
Expand Down Expand Up @@ -161,6 +162,7 @@ def isDir(self):
return "." not in self.path.split("/")[-1]


# pylint: disable=too-many-public-methods
class TestSparkDataSet:
def test_load_parquet(self, tmp_path, sample_pandas_df):
temp_path = (tmp_path / "data").as_posix()
Expand Down Expand Up @@ -440,6 +442,25 @@ def test_copy(self):
assert spark_dataset_copy._file_format == "csv"
assert spark_dataset_copy._save_args == {"mode": "overwrite"}

def test_dbfs_prefix_warning(self, monkeypatch, caplog):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit fussy, but I think this should be 3 separate tests or 3 separate cases of a parameterised test. Even though they're small, each assert tests a different case here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right about this, though the linter complains that the file is long when these verbose tests are added (I originally had three separate tests). I would turn the warning off, but honestly I am a bit tired of ignoring the linter. There's not much point to using it if we ignore it every time it tells us off. My solution would be to add this slimmer method for now as this is quite urgent, then open an issue to split the test file. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, a better solution would be to split this back into three tests, add an ignore directive. Then split the file in a new issue and remove the ignore directive 🤔. This is what I will do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmholzer agree with this. Maybe we should just add more global ignores to the linter configuration for tests to make it less strict if you're finding that you continually need to add inline ignores.

expected_message = (
"Using SparkDataSet on Databricks without the `/dbfs` prefix in the "
"filepath raises an error. Add this prefix to fix the error."
)

# test that warning is not raised when not on Databricks
SparkDataSet(filepath="my_project/data/02_intermediate/processed_data")
assert expected_message not in caplog.text

# test that warning is not raised when on Databricks and filepath has /dbfs prefix
monkeypatch.setenv("DATABRICKS_RUNTIME_VERSION", "7.3")
SparkDataSet(filepath="/dbfs/my_project/data/02_intermediate/processed_data")
assert expected_message not in caplog.text

# test that warning is raised when on Databricks and filepath does not have /dbfs prefix
SparkDataSet(filepath="my_project/data/02_intermediate/processed_data")
assert expected_message in caplog.text


class TestSparkDataSetVersionedLocal:
def test_no_version(self, versioned_dataset_local):
Expand Down