Add warning when `SparkDataSet` is used on Databricks without a valid file path #114

jmholzer · 2023-02-03T17:21:25Z

Signed-off-by: Jannic Holzer jannic.holzer@quantumblack.com

Description

Resolves #121.

~~Adds a guard that adds the prefix '/dbfs' to the file path passed into SparkDataSet in the case that it is running on DataBricks and has been omitted by the user.~~

Without this, the _exists_function attribute of SparkDataSet will default to _local_exists (defined in kedro.io.core.py), which will fail to resolve any data on the node that is running the code. This is because data is not stored locally on these nodes, it is stored in DBFS instead.

Add a warning that versioning will not work when a user attempts to use a SparkDataSet with a filepath not prefixed with /dbfs.

Development notes

Added two simple checks too see if:

SparkDataSet is running on Databricks.
filepath is prefixed with /dbfs

If both are true, log a warning to the user.

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

antonymilne

This is an excellent fix ⭐ A couple of things I want to ask about in our chat, but if this works with our pandas-iris starter then I'm very happy with it 👍

other spark datasets
SKIP_IF_NO_SPARK
local_exists
other dbfs stuff

antonymilne · 2023-02-15T07:34:52Z

kedro-datasets/tests/spark/test_spark_dataset.py

+    def test_dbfs_filepath_prefix(self, filepath, mocker):
+        mocker.patch.dict(os.environ, {"DATABRICKS_RUNTIME_VERSION": "7.3"})


Suggested change

def test_dbfs_filepath_prefix(self, filepath, mocker):

mocker.patch.dict(os.environ, {"DATABRICKS_RUNTIME_VERSION": "7.3"})

def test_dbfs_filepath_prefix(self, filepath, monkeypatch):

monkeypatch.setenv( "DATABRICKS_RUNTIME_VERSION", "7.3")

Just for consistency with other tests (see e.g. test_rich_traceback_disabled_on_databricks and several others).

Thanks, made this change.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

…dro-plugins into fix/versioned-spark-dataset Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

noklam · 2023-02-16T08:02:27Z

kedro-datasets/kedro_datasets/spark/spark_dataset.py

@@ -114,6 +118,20 @@ def _dbfs_exists(pattern: str, dbutils: Any) -> bool:
        return False


+def _deployed_on_databricks() -> bool:
+    """Check if running on Databricks."""


https://github.com/kedro-org/kedro/blob/1c6d108b56e405364be3cf1d9cfca65fe6606096/kedro/framework/project/__init__.py#LL217-L218C87

This should be a function now to determine whether the project is run in Databricks.

Do you mean we should add a function to Kedro in a separate PR and import that here?

Yes since they are doing the exact same thing, we should just keep the logic consistent. If this is a more pressing bug fix I think it's fine to keep this in the PR but we should still refactor it in kedro later.

While I am generally very much pro-DRY, in this case I would prefer we just kept this duplicated since...

it's not clear where such a function would live, and presumably it would be a private function, in which case kedro-datasets would become coupled to a private function in kedro core

it's a very small and simple piece of logic which is unlikely to change any time soon

the "check if we are on databricks" logic in kedro core is almost definitely going to move anyway as part of Introduce our own rich console logging handler kedro#2277

Overall, I'd go for the principle that "duplication is cheaper than getting it wrong" here and so just leave it duplicated.

noklam · 2023-02-16T08:06:03Z

kedro-datasets/kedro_datasets/spark/spark_dataset.py

+                logger.error(
+                    "Using SparkDataSet on Databricks without the `/dbfs` prefix in the "
+                    "filepath raises an error. Add this prefix to fix the error."
+                )


Should this be a DataSetError since this is considered an invalid config? What kind of error are we getting now, is it some FileNotFound error due to unexpected path?

A DataSetError gets raised on trying to save with versioning enabled, but it is not clear from the message why this is. This message is supposed to inform the user. Do you think we should raise an error with a clearer message instead?

Yes, I think if this is an invalid code path and we know that it's going to fail we should raise the error earlier, what's the point of delaying until later?

I changed this back to a warning in d3a0a80 and refer to 'undefined behaviour' instead.

I also modified the doc string to make it clearer to the users that they need to specify this /dbfs prefix.

Yes, I think if this is an invalid code path and we know that it's going to fail we should raise the error earlier, what's the point of delaying until later?

Ah sorry, I didn't see this reply before I wrote my last message.

I would agree with you but there is one subtlety, which is that if you specify a version in SparkDataSet, file path resolution will work. I wouldn't like to raise an error here as this would be a breaking change for the users who do specify a version. Currently, it is only the automatic resolution of the latest version that is broken.

The alternative would be to add another condition to the if statement, checking whether a version has been specified and only raising an error in the case that it wasn't. However, I don't like this, as it leads the user into thinking it is OK to leave out the /dbfs/ prefix when a version is specified. As far as I understand, they should always be using this prefix. Therefore, I would rather display a warning whenever a SparkDataSet is invoked incorrectly in every case.

I see, in that case I agree we should not break the existing behaviors and I think the current solution with warning is fine.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

noklam

Thanks for explaining the problem. I am fine with the changes but got a bit confused what was the problem.

Is kedro-org/kedro#1801 an accurate description of what's happening? Why does it works when version is provided, is this also a permission issue about iterating the directory?

antonymilne

This is great, thanks for getting to the bottom of it. I definitely feel more comfortable about the solution than actually changing the filepath 👍

antonymilne · 2023-02-20T09:55:21Z

kedro-datasets/kedro_datasets/spark/spark_dataset.py

@@ -114,6 +118,20 @@ def _dbfs_exists(pattern: str, dbutils: Any) -> bool:
        return False


+def _deployed_on_databricks() -> bool:
+    """Check if running on Databricks."""


While I am generally very much pro-DRY, in this case I would prefer we just kept this duplicated since...

it's not clear where such a function would live, and presumably it would be a private function, in which case kedro-datasets would become coupled to a private function in kedro core

it's a very small and simple piece of logic which is unlikely to change any time soon

the "check if we are on databricks" logic in kedro core is almost definitely going to move anyway as part of Introduce our own rich console logging handler kedro#2277

Overall, I'd go for the principle that "duplication is cheaper than getting it wrong" here and so just leave it duplicated.

kedro-datasets/kedro_datasets/spark/spark_dataset.py

antonymilne · 2023-02-20T09:57:49Z

kedro-datasets/kedro_datasets/spark/spark_dataset.py

+            if _deployed_on_databricks() and not _path_has_dbfs_prefix(filepath):
+                logger.warning(
+                    "Using SparkDataSet on Databricks without the `/dbfs/` prefix in the "
+                    "filepath is a known source of error. You must add this prefix."


Suggested change

"filepath is a known source of error. You must add this prefix."

"filepath is a known source of error. You must add this prefix to the relevant entries in your data catalog."

And maybe even log the filepath as part of the warning here also just to make it super easy for someone to identify and fix the right datasets? Since we don't have access to the names of the problematic datasets.

Good idea, I will add this.

antonymilne · 2023-02-20T09:59:37Z

kedro-datasets/tests/spark/test_spark_dataset.py

@@ -440,6 +442,25 @@ def test_copy(self):
        assert spark_dataset_copy._file_format == "csv"
        assert spark_dataset_copy._save_args == {"mode": "overwrite"}

+    def test_dbfs_prefix_warning(self, monkeypatch, caplog):


A bit fussy, but I think this should be 3 separate tests or 3 separate cases of a parameterised test. Even though they're small, each assert tests a different case here.

You are right about this, though the linter complains that the file is long when these verbose tests are added (I originally had three separate tests). I would turn the warning off, but honestly I am a bit tired of ignoring the linter. There's not much point to using it if we ignore it every time it tells us off. My solution would be to add this slimmer method for now as this is quite urgent, then open an issue to split the test file. What do you think?

Hmm, a better solution would be to split this back into three tests, add an ignore directive. Then split the file in a new issue and remove the ignore directive 🤔. This is what I will do.

@jmholzer agree with this. Maybe we should just add more global ignores to the linter configuration for tests to make it less strict if you're finding that you continually need to add inline ignores.

noklam

Approved since I don't want to block this, the changes looks good to me, just have some questions for clarifications.

jmholzer · 2023-02-20T15:21:34Z

Is kedro-org/kedro#1801 an accurate description of what's happening?

This issue documents the error that users receive when the exists_function cannot resolve the given file path. There are evidently a few ways to trigger this, one of which is by leaving out the /dbfs prefix on Databricks. The user who posted this issue originally experienced the issue on Windows, though we don't have enough information to tell what caused it in that case.

The long-term solution to the general problem looks to be to refactor SparkDataSet. Specifically, we should bring the way that exists_function is determined in line with other datasets (#117). Doing so would resolve many cases in which this error is seen.

I created a new issue to link this PR to. I will leave the issue you mentioned open for now and see where we are after refactoring SparkDataSet.

Why does it works when version is provided?

In the case that a version is provided, the exists_function is not used to resolve a save location. It is this that causes the problem.

is this also a permission issue about iterating the directory?

This could cause the same error in some cases, though in the case of deployment on Databricks this is unlikely to be the problem.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

… file path (kedro-org#114) * Add databricks deployment check and automatic DBFS path addition Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add newline at end of file Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious 'not' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move dbfs utility functions from SparkDataSet Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add edge case logic to _build_dbfs_path Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add test for dbfs path construction Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Linting Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious print statement :) Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add pylint disable too-many-public-methods Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move tests into single method to appease linter Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify prefix check to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify prefix check to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Make warning message clearer Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add release note Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix linting Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update warning message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify log warning level to error Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify message back to warning, refer to undefined behaviour Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify required prefix to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify doc string Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify warning message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Split tests and add filepath to warning Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify f string in logging call Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix tests Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Lint Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> --------- Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com>

jmholzer added 10 commits February 3, 2023 17:19

Add databricks deployment check and automatic DBFS path addition

6e5373e

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Add newline at end of file

bdb6958

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Remove spurious 'not'

50c72d3

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Move dbfs utility functions from SparkDataSet

6fc8ca6

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Add edge case logic to _build_dbfs_path

868913c

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Add test for dbfs path construction

85c5862

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Linting

7f5bcda

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Remove spurious print statement :)

f6f60c1

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Add pylint disable too-many-public-methods

da9bb60

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Merge branch 'main' into fix/versioned-spark-dataset

8d4a7f2

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

jmholzer requested review from noklam and antonymilne February 10, 2023 15:03

jmholzer marked this pull request as ready for review February 10, 2023 15:29

antonymilne approved these changes Feb 15, 2023

View reviewed changes

jmholzer added 9 commits February 15, 2023 19:04

Move tests into single method to appease linter

3d5116c

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Modify prefix check to /dbfs/

d43a53f

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Modify prefix check to /dbfs/

ee43277

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Merge branch 'fix/versioned-spark-dataset' of github.com:kedro-org/ke…

8390ef7

…dro-plugins into fix/versioned-spark-dataset Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Make warning message clearer

271c76c

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Add release note

c1cce42

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Fix linting

40b90f7

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Update warning message

086fa08

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Modify log warning level to error

307f8c4

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

noklam reviewed Feb 16, 2023

View reviewed changes

jmholzer added 4 commits February 16, 2023 10:17

Modify message back to warning, refer to undefined behaviour

d3a0a80

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Modify required prefix to /dbfs/

c349fa8

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Modify doc string

d882b7b

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Modify warning message

44b16f5

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

jmholzer changed the title ~~Fix SparkDataSet not resolving file paths when deployed on Databricks~~ Add warning when SparkDataSet is used on Databricks without a valid file path Feb 16, 2023

antonymilne requested a review from noklam February 17, 2023 09:28

antonymilne self-requested a review February 17, 2023 09:28

noklam reviewed Feb 17, 2023

View reviewed changes

antonymilne approved these changes Feb 20, 2023

View reviewed changes

noklam approved these changes Feb 20, 2023

View reviewed changes

jmholzer added 4 commits February 20, 2023 16:33

Split tests and add filepath to warning

017d3fa

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Modify f string in logging call

b448297

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Fix tests

37d1bdd

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Lint

8c5f92c

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

jmholzer merged commit 1aae45c into main Mar 6, 2023

jmholzer deleted the fix/versioned-spark-dataset branch March 6, 2023 12:58

This was referenced Mar 6, 2023

DBFS documentation kedro-org/kedro#2385

Merged

Add databricks environment to pyspark-iris starter kedro-org/kedro-starters#119

Merged

noklam mentioned this pull request Apr 3, 2023

Versioned does not work for spark.SparkDataSet kedro-org/kedro#1801

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add warning when `SparkDataSet` is used on Databricks without a valid file path #114

Add warning when `SparkDataSet` is used on Databricks without a valid file path #114

jmholzer commented Feb 3, 2023 •

edited

Loading

antonymilne left a comment

antonymilne Feb 15, 2023

jmholzer Feb 15, 2023

noklam Feb 16, 2023

jmholzer Feb 16, 2023

noklam Feb 16, 2023

antonymilne Feb 20, 2023

noklam Feb 16, 2023

jmholzer Feb 16, 2023

noklam Feb 16, 2023

jmholzer Feb 16, 2023

jmholzer Feb 16, 2023

noklam Feb 16, 2023

noklam left a comment

antonymilne left a comment

antonymilne Feb 20, 2023

antonymilne Feb 20, 2023

jmholzer Feb 20, 2023

antonymilne Feb 20, 2023

jmholzer Feb 20, 2023

jmholzer Feb 20, 2023

antonymilne Feb 22, 2023

noklam left a comment

jmholzer commented Feb 20, 2023 •

edited

Loading

		def test_dbfs_filepath_prefix(self, filepath, mocker):
		mocker.patch.dict(os.environ, {"DATABRICKS_RUNTIME_VERSION": "7.3"})

	"filepath is a known source of error. You must add this prefix."
	"filepath is a known source of error. You must add this prefix to the relevant entries in your data catalog."

Add warning when SparkDataSet is used on Databricks without a valid file path #114

Add warning when SparkDataSet is used on Databricks without a valid file path #114

Conversation

jmholzer commented Feb 3, 2023 • edited Loading

Description

Development notes

Checklist

antonymilne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noklam left a comment

Choose a reason for hiding this comment

antonymilne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noklam left a comment

Choose a reason for hiding this comment

jmholzer commented Feb 20, 2023 • edited Loading

Add warning when `SparkDataSet` is used on Databricks without a valid file path #114

Add warning when `SparkDataSet` is used on Databricks without a valid file path #114

jmholzer commented Feb 3, 2023 •

edited

Loading

jmholzer commented Feb 20, 2023 •

edited

Loading