Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warning when SparkDataSet is used on Databricks without a valid file path #114

Merged
merged 27 commits into from
Mar 6, 2023

Conversation

jmholzer
Copy link
Contributor

@jmholzer jmholzer commented Feb 3, 2023

Signed-off-by: Jannic Holzer jannic.holzer@quantumblack.com

Description

Resolves #121.

Adds a guard that adds the prefix '/dbfs' to the file path passed into SparkDataSet in the case that it is running on DataBricks and has been omitted by the user.

Without this, the _exists_function attribute of SparkDataSet will default to _local_exists (defined in kedro.io.core.py), which will fail to resolve any data on the node that is running the code. This is because data is not stored locally on these nodes, it is stored in DBFS instead.

Add a warning that versioning will not work when a user attempts to use a SparkDataSet with a filepath not prefixed with /dbfs.

Development notes

Added two simple checks too see if:

  1. SparkDataSet is running on Databricks.
  2. filepath is prefixed with /dbfs

If both are true, log a warning to the user.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
@jmholzer jmholzer marked this pull request as ready for review February 10, 2023 15:29
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent fix ⭐ A couple of things I want to ask about in our chat, but if this works with our pandas-iris starter then I'm very happy with it 👍

  • other spark datasets
  • SKIP_IF_NO_SPARK
  • local_exists
  • other dbfs stuff

Comment on lines 446 to 447
def test_dbfs_filepath_prefix(self, filepath, mocker):
mocker.patch.dict(os.environ, {"DATABRICKS_RUNTIME_VERSION": "7.3"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_dbfs_filepath_prefix(self, filepath, mocker):
mocker.patch.dict(os.environ, {"DATABRICKS_RUNTIME_VERSION": "7.3"})
def test_dbfs_filepath_prefix(self, filepath, monkeypatch):
monkeypatch.setenv( "DATABRICKS_RUNTIME_VERSION", "7.3")

Just for consistency with other tests (see e.g. test_rich_traceback_disabled_on_databricks and several others).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, made this change.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
…dro-plugins into fix/versioned-spark-dataset

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
@@ -114,6 +118,20 @@ def _dbfs_exists(pattern: str, dbutils: Any) -> bool:
return False


def _deployed_on_databricks() -> bool:
"""Check if running on Databricks."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should add a function to Kedro in a separate PR and import that here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes since they are doing the exact same thing, we should just keep the logic consistent. If this is a more pressing bug fix I think it's fine to keep this in the PR but we should still refactor it in kedro later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I am generally very much pro-DRY, in this case I would prefer we just kept this duplicated since...

  • it's not clear where such a function would live, and presumably it would be a private function, in which case kedro-datasets would become coupled to a private function in kedro core
  • it's a very small and simple piece of logic which is unlikely to change any time soon
  • the "check if we are on databricks" logic in kedro core is almost definitely going to move anyway as part of Introduce our own rich console logging handler kedro#2277

Overall, I'd go for the principle that "duplication is cheaper than getting it wrong" here and so just leave it duplicated.

Comment on lines 326 to 329
logger.error(
"Using SparkDataSet on Databricks without the `/dbfs` prefix in the "
"filepath raises an error. Add this prefix to fix the error."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a DataSetError since this is considered an invalid config? What kind of error are we getting now, is it some FileNotFound error due to unexpected path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A DataSetError gets raised on trying to save with versioning enabled, but it is not clear from the message why this is. This message is supposed to inform the user. Do you think we should raise an error with a clearer message instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think if this is an invalid code path and we know that it's going to fail we should raise the error earlier, what's the point of delaying until later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this back to a warning in d3a0a80 and refer to 'undefined behaviour' instead.

I also modified the doc string to make it clearer to the users that they need to specify this /dbfs prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think if this is an invalid code path and we know that it's going to fail we should raise the error earlier, what's the point of delaying until later?

Ah sorry, I didn't see this reply before I wrote my last message.

I would agree with you but there is one subtlety, which is that if you specify a version in SparkDataSet, file path resolution will work. I wouldn't like to raise an error here as this would be a breaking change for the users who do specify a version. Currently, it is only the automatic resolution of the latest version that is broken.

The alternative would be to add another condition to the if statement, checking whether a version has been specified and only raising an error in the case that it wasn't. However, I don't like this, as it leads the user into thinking it is OK to leave out the /dbfs/ prefix when a version is specified. As far as I understand, they should always be using this prefix. Therefore, I would rather display a warning whenever a SparkDataSet is invoked incorrectly in every case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, in that case I agree we should not break the existing behaviors and I think the current solution with warning is fine.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
@jmholzer jmholzer changed the title Fix SparkDataSet not resolving file paths when deployed on Databricks Add warning when SparkDataSet is used on Databricks without a valid file path Feb 16, 2023
@antonymilne antonymilne self-requested a review February 17, 2023 09:28
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining the problem. I am fine with the changes but got a bit confused what was the problem.

Is kedro-org/kedro#1801 an accurate description of what's happening? Why does it works when version is provided, is this also a permission issue about iterating the directory?

Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks for getting to the bottom of it. I definitely feel more comfortable about the solution than actually changing the filepath 👍

@@ -114,6 +118,20 @@ def _dbfs_exists(pattern: str, dbutils: Any) -> bool:
return False


def _deployed_on_databricks() -> bool:
"""Check if running on Databricks."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I am generally very much pro-DRY, in this case I would prefer we just kept this duplicated since...

  • it's not clear where such a function would live, and presumably it would be a private function, in which case kedro-datasets would become coupled to a private function in kedro core
  • it's a very small and simple piece of logic which is unlikely to change any time soon
  • the "check if we are on databricks" logic in kedro core is almost definitely going to move anyway as part of Introduce our own rich console logging handler kedro#2277

Overall, I'd go for the principle that "duplication is cheaper than getting it wrong" here and so just leave it duplicated.

if _deployed_on_databricks() and not _path_has_dbfs_prefix(filepath):
logger.warning(
"Using SparkDataSet on Databricks without the `/dbfs/` prefix in the "
"filepath is a known source of error. You must add this prefix."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"filepath is a known source of error. You must add this prefix."
"filepath is a known source of error. You must add this prefix to the relevant entries in your data catalog."

And maybe even log the filepath as part of the warning here also just to make it super easy for someone to identify and fix the right datasets? Since we don't have access to the names of the problematic datasets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I will add this.

@@ -440,6 +442,25 @@ def test_copy(self):
assert spark_dataset_copy._file_format == "csv"
assert spark_dataset_copy._save_args == {"mode": "overwrite"}

def test_dbfs_prefix_warning(self, monkeypatch, caplog):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit fussy, but I think this should be 3 separate tests or 3 separate cases of a parameterised test. Even though they're small, each assert tests a different case here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right about this, though the linter complains that the file is long when these verbose tests are added (I originally had three separate tests). I would turn the warning off, but honestly I am a bit tired of ignoring the linter. There's not much point to using it if we ignore it every time it tells us off. My solution would be to add this slimmer method for now as this is quite urgent, then open an issue to split the test file. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, a better solution would be to split this back into three tests, add an ignore directive. Then split the file in a new issue and remove the ignore directive 🤔. This is what I will do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmholzer agree with this. Maybe we should just add more global ignores to the linter configuration for tests to make it less strict if you're finding that you continually need to add inline ignores.

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved since I don't want to block this, the changes looks good to me, just have some questions for clarifications.

@jmholzer
Copy link
Contributor Author

jmholzer commented Feb 20, 2023

Is kedro-org/kedro#1801 an accurate description of what's happening?

This issue documents the error that users receive when the exists_function cannot resolve the given file path. There are evidently a few ways to trigger this, one of which is by leaving out the /dbfs prefix on Databricks. The user who posted this issue originally experienced the issue on Windows, though we don't have enough information to tell what caused it in that case.

The long-term solution to the general problem looks to be to refactor SparkDataSet. Specifically, we should bring the way that exists_function is determined in line with other datasets (#117). Doing so would resolve many cases in which this error is seen.

I created a new issue to link this PR to. I will leave the issue you mentioned open for now and see where we are after refactoring SparkDataSet.

Why does it works when version is provided?

In the case that a version is provided, the exists_function is not used to resolve a save location. It is this that causes the problem.

is this also a permission issue about iterating the directory?

This could cause the same error in some cases, though in the case of deployment on Databricks this is unlikely to be the problem.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
@jmholzer jmholzer merged commit 1aae45c into main Mar 6, 2023
@jmholzer jmholzer deleted the fix/versioned-spark-dataset branch March 6, 2023 12:58
dannyrfar pushed a commit to dannyrfar/kedro-plugins that referenced this pull request Mar 13, 2023
… file path (kedro-org#114)

* Add databricks deployment check and automatic DBFS path addition

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add newline at end of file

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove spurious 'not'

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Move dbfs utility functions from SparkDataSet

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add edge case logic to _build_dbfs_path

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add test for dbfs path construction

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Linting

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove spurious print statement :)

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add pylint disable too-many-public-methods

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Move tests into single method to appease linter

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify prefix check to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify prefix check to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Make warning message clearer

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add release note

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix linting

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update warning message

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify log warning level to error

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify message back to warning, refer to undefined behaviour

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify required prefix to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify doc string

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify warning message

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Split tests and add filepath to warning

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify f string in logging call

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix tests

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Lint

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

---------

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
dannyrfar pushed a commit to dannyrfar/kedro-plugins that referenced this pull request Mar 21, 2023
… file path (kedro-org#114)

* Add databricks deployment check and automatic DBFS path addition

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add newline at end of file

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove spurious 'not'

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Move dbfs utility functions from SparkDataSet

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add edge case logic to _build_dbfs_path

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add test for dbfs path construction

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Linting

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove spurious print statement :)

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add pylint disable too-many-public-methods

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Move tests into single method to appease linter

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify prefix check to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify prefix check to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Make warning message clearer

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add release note

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix linting

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update warning message

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify log warning level to error

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify message back to warning, refer to undefined behaviour

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify required prefix to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify doc string

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify warning message

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Split tests and add filepath to warning

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify f string in logging call

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix tests

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Lint

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

---------

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
dannyrfar pushed a commit to dannyrfar/kedro-plugins that referenced this pull request Mar 21, 2023
… file path (kedro-org#114)

* Add databricks deployment check and automatic DBFS path addition

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add newline at end of file

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove spurious 'not'

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Move dbfs utility functions from SparkDataSet

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add edge case logic to _build_dbfs_path

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add test for dbfs path construction

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Linting

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove spurious print statement :)

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add pylint disable too-many-public-methods

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Move tests into single method to appease linter

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify prefix check to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify prefix check to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Make warning message clearer

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add release note

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix linting

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update warning message

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify log warning level to error

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify message back to warning, refer to undefined behaviour

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify required prefix to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify doc string

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify warning message

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Split tests and add filepath to warning

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify f string in logging call

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix tests

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Lint

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

---------

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
dannyrfar pushed a commit to dannyrfar/kedro-plugins that referenced this pull request Mar 21, 2023
… file path (kedro-org#114)

* Add databricks deployment check and automatic DBFS path addition

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add newline at end of file

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove spurious 'not'

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Move dbfs utility functions from SparkDataSet

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add edge case logic to _build_dbfs_path

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add test for dbfs path construction

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Linting

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove spurious print statement :)

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add pylint disable too-many-public-methods

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Move tests into single method to appease linter

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify prefix check to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify prefix check to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Make warning message clearer

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add release note

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix linting

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update warning message

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify log warning level to error

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify message back to warning, refer to undefined behaviour

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify required prefix to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify doc string

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify warning message

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Split tests and add filepath to warning

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify f string in logging call

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix tests

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Lint

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

---------

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
dannyrfar pushed a commit to dannyrfar/kedro-plugins that referenced this pull request Mar 21, 2023
… file path (kedro-org#114)

* Add databricks deployment check and automatic DBFS path addition

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add newline at end of file

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove spurious 'not'

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Move dbfs utility functions from SparkDataSet

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add edge case logic to _build_dbfs_path

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add test for dbfs path construction

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Linting

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove spurious print statement :)

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add pylint disable too-many-public-methods

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Move tests into single method to appease linter

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify prefix check to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify prefix check to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Make warning message clearer

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add release note

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix linting

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update warning message

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify log warning level to error

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify message back to warning, refer to undefined behaviour

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify required prefix to /dbfs/

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify doc string

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify warning message

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Split tests and add filepath to warning

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify f string in logging call

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix tests

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Lint

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

---------

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add warning when SparkDataSet is used on Databricks without a valid file path
3 participants