-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add warning when SparkDataSet
is used on Databricks without a valid file path
#114
Conversation
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent fix ⭐ A couple of things I want to ask about in our chat, but if this works with our pandas-iris starter then I'm very happy with it 👍
- other spark datasets
SKIP_IF_NO_SPARK
local_exists
- other
dbfs
stuff
def test_dbfs_filepath_prefix(self, filepath, mocker): | ||
mocker.patch.dict(os.environ, {"DATABRICKS_RUNTIME_VERSION": "7.3"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_dbfs_filepath_prefix(self, filepath, mocker): | |
mocker.patch.dict(os.environ, {"DATABRICKS_RUNTIME_VERSION": "7.3"}) | |
def test_dbfs_filepath_prefix(self, filepath, monkeypatch): | |
monkeypatch.setenv( "DATABRICKS_RUNTIME_VERSION", "7.3") |
Just for consistency with other tests (see e.g. test_rich_traceback_disabled_on_databricks
and several others).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, made this change.
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
…dro-plugins into fix/versioned-spark-dataset Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
@@ -114,6 +118,20 @@ def _dbfs_exists(pattern: str, dbutils: Any) -> bool: | |||
return False | |||
|
|||
|
|||
def _deployed_on_databricks() -> bool: | |||
"""Check if running on Databricks.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a function now to determine whether the project is run in Databricks
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we should add a function to Kedro in a separate PR and import that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes since they are doing the exact same thing, we should just keep the logic consistent. If this is a more pressing bug fix I think it's fine to keep this in the PR but we should still refactor it in kedro
later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I am generally very much pro-DRY, in this case I would prefer we just kept this duplicated since...
- it's not clear where such a function would live, and presumably it would be a private function, in which case kedro-datasets would become coupled to a private function in kedro core
- it's a very small and simple piece of logic which is unlikely to change any time soon
- the "check if we are on databricks" logic in kedro core is almost definitely going to move anyway as part of Introduce our own rich console logging handler kedro#2277
Overall, I'd go for the principle that "duplication is cheaper than getting it wrong" here and so just leave it duplicated.
logger.error( | ||
"Using SparkDataSet on Databricks without the `/dbfs` prefix in the " | ||
"filepath raises an error. Add this prefix to fix the error." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a DataSetError
since this is considered an invalid config? What kind of error are we getting now, is it some FileNotFound error due to unexpected path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A DataSetError gets raised on trying to save with versioning enabled, but it is not clear from the message why this is. This message is supposed to inform the user. Do you think we should raise an error with a clearer message instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think if this is an invalid code path and we know that it's going to fail we should raise the error earlier, what's the point of delaying until later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this back to a warning in d3a0a80 and refer to 'undefined behaviour' instead.
I also modified the doc string to make it clearer to the users that they need to specify this /dbfs
prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think if this is an invalid code path and we know that it's going to fail we should raise the error earlier, what's the point of delaying until later?
Ah sorry, I didn't see this reply before I wrote my last message.
I would agree with you but there is one subtlety, which is that if you specify a version in SparkDataSet
, file path resolution will work. I wouldn't like to raise an error here as this would be a breaking change for the users who do specify a version. Currently, it is only the automatic resolution of the latest version that is broken.
The alternative would be to add another condition to the if
statement, checking whether a version has been specified and only raising an error in the case that it wasn't. However, I don't like this, as it leads the user into thinking it is OK to leave out the /dbfs/
prefix when a version is specified. As far as I understand, they should always be using this prefix. Therefore, I would rather display a warning whenever a SparkDataSet
is invoked incorrectly in every case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, in that case I agree we should not break the existing behaviors and I think the current solution with warning is fine.
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
SparkDataSet
is used on Databricks without a valid file path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining the problem. I am fine with the changes but got a bit confused what was the problem.
Is kedro-org/kedro#1801 an accurate description of what's happening? Why does it works when version is provided, is this also a permission issue about iterating the directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks for getting to the bottom of it. I definitely feel more comfortable about the solution than actually changing the filepath 👍
@@ -114,6 +118,20 @@ def _dbfs_exists(pattern: str, dbutils: Any) -> bool: | |||
return False | |||
|
|||
|
|||
def _deployed_on_databricks() -> bool: | |||
"""Check if running on Databricks.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I am generally very much pro-DRY, in this case I would prefer we just kept this duplicated since...
- it's not clear where such a function would live, and presumably it would be a private function, in which case kedro-datasets would become coupled to a private function in kedro core
- it's a very small and simple piece of logic which is unlikely to change any time soon
- the "check if we are on databricks" logic in kedro core is almost definitely going to move anyway as part of Introduce our own rich console logging handler kedro#2277
Overall, I'd go for the principle that "duplication is cheaper than getting it wrong" here and so just leave it duplicated.
if _deployed_on_databricks() and not _path_has_dbfs_prefix(filepath): | ||
logger.warning( | ||
"Using SparkDataSet on Databricks without the `/dbfs/` prefix in the " | ||
"filepath is a known source of error. You must add this prefix." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"filepath is a known source of error. You must add this prefix." | |
"filepath is a known source of error. You must add this prefix to the relevant entries in your data catalog." |
And maybe even log the filepath
as part of the warning here also just to make it super easy for someone to identify and fix the right datasets? Since we don't have access to the names of the problematic datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I will add this.
@@ -440,6 +442,25 @@ def test_copy(self): | |||
assert spark_dataset_copy._file_format == "csv" | |||
assert spark_dataset_copy._save_args == {"mode": "overwrite"} | |||
|
|||
def test_dbfs_prefix_warning(self, monkeypatch, caplog): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit fussy, but I think this should be 3 separate tests or 3 separate cases of a parameterised test. Even though they're small, each assert
tests a different case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right about this, though the linter complains that the file is long when these verbose tests are added (I originally had three separate tests). I would turn the warning off, but honestly I am a bit tired of ignoring the linter. There's not much point to using it if we ignore it every time it tells us off. My solution would be to add this slimmer method for now as this is quite urgent, then open an issue to split the test file. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, a better solution would be to split this back into three tests, add an ignore directive. Then split the file in a new issue and remove the ignore directive 🤔. This is what I will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmholzer agree with this. Maybe we should just add more global ignores to the linter configuration for tests to make it less strict if you're finding that you continually need to add inline ignores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved since I don't want to block this, the changes looks good to me, just have some questions for clarifications.
This issue documents the error that users receive when the The long-term solution to the general problem looks to be to refactor I created a new issue to link this PR to. I will leave the issue you mentioned open for now and see where we are after refactoring
In the case that a version is provided, the
This could cause the same error in some cases, though in the case of deployment on Databricks this is unlikely to be the problem. |
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
… file path (kedro-org#114) * Add databricks deployment check and automatic DBFS path addition Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add newline at end of file Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious 'not' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move dbfs utility functions from SparkDataSet Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add edge case logic to _build_dbfs_path Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add test for dbfs path construction Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Linting Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious print statement :) Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add pylint disable too-many-public-methods Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move tests into single method to appease linter Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify prefix check to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify prefix check to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Make warning message clearer Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add release note Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix linting Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update warning message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify log warning level to error Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify message back to warning, refer to undefined behaviour Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify required prefix to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify doc string Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify warning message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Split tests and add filepath to warning Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify f string in logging call Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix tests Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Lint Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> --------- Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
… file path (kedro-org#114) * Add databricks deployment check and automatic DBFS path addition Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add newline at end of file Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious 'not' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move dbfs utility functions from SparkDataSet Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add edge case logic to _build_dbfs_path Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add test for dbfs path construction Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Linting Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious print statement :) Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add pylint disable too-many-public-methods Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move tests into single method to appease linter Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify prefix check to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify prefix check to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Make warning message clearer Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add release note Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix linting Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update warning message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify log warning level to error Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify message back to warning, refer to undefined behaviour Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify required prefix to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify doc string Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify warning message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Split tests and add filepath to warning Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify f string in logging call Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix tests Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Lint Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> --------- Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
… file path (kedro-org#114) * Add databricks deployment check and automatic DBFS path addition Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add newline at end of file Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious 'not' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move dbfs utility functions from SparkDataSet Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add edge case logic to _build_dbfs_path Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add test for dbfs path construction Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Linting Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious print statement :) Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add pylint disable too-many-public-methods Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move tests into single method to appease linter Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify prefix check to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify prefix check to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Make warning message clearer Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add release note Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix linting Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update warning message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify log warning level to error Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify message back to warning, refer to undefined behaviour Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify required prefix to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify doc string Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify warning message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Split tests and add filepath to warning Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify f string in logging call Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix tests Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Lint Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> --------- Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
… file path (kedro-org#114) * Add databricks deployment check and automatic DBFS path addition Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add newline at end of file Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious 'not' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move dbfs utility functions from SparkDataSet Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add edge case logic to _build_dbfs_path Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add test for dbfs path construction Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Linting Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious print statement :) Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add pylint disable too-many-public-methods Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move tests into single method to appease linter Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify prefix check to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify prefix check to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Make warning message clearer Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add release note Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix linting Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update warning message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify log warning level to error Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify message back to warning, refer to undefined behaviour Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify required prefix to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify doc string Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify warning message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Split tests and add filepath to warning Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify f string in logging call Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix tests Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Lint Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> --------- Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
… file path (kedro-org#114) * Add databricks deployment check and automatic DBFS path addition Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add newline at end of file Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious 'not' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move dbfs utility functions from SparkDataSet Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add edge case logic to _build_dbfs_path Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add test for dbfs path construction Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Linting Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious print statement :) Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add pylint disable too-many-public-methods Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move tests into single method to appease linter Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify prefix check to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify prefix check to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Make warning message clearer Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add release note Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix linting Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update warning message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify log warning level to error Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify message back to warning, refer to undefined behaviour Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify required prefix to /dbfs/ Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify doc string Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify warning message Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Split tests and add filepath to warning Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify f string in logging call Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix tests Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Lint Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> --------- Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer jannic.holzer@quantumblack.com
Description
Resolves #121.
Adds a guard that adds the prefix '/dbfs' to the file path passed intoSparkDataSet
in the case that it is running on DataBricks and has been omitted by the user.Without this, the_exists_function
attribute ofSparkDataSet
will default to_local_exists
(defined inkedro.io.core.py
), which will fail to resolve any data on the node that is running the code. This is because data is not stored locally on these nodes, it is stored in DBFS instead.Add a warning that versioning will not work when a user attempts to use a
SparkDataSet
with a filepath not prefixed with/dbfs
.Development notes
Added two simple checks too see if:
SparkDataSet
is running on Databricks.filepath
is prefixed with/dbfs
If both are true, log a warning to the user.
Checklist
RELEASE.md
file