You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using a Windows machine with Databricks connect to load a SparkDataSet in DBFS throws an Exception
Context
SparkDataSet implementation in Windows creates the _filepath attribute using Path() in __init__(), except when the filepath prefix is for S3 or HDFS, in which case there is a special logic.
This way a WindowsPath is generated by default, but if we want to reference a DBFS filepath if fails when loading the dataset because the WindowsPath gets backslashes \ as separator when converting to a string
Steps to Reproduce
Load a SparkDataSet where the filepath references DBFS in a Windows machine using Databricks-connect
Expected Result
The DataSet gets actually loaded
Actual Result
An Exception is raised
kedro.io.core.DataSetError: Failed while loading data from data set SparkDataSet(file_format=parquet, filepath=dbfs://mnt\this\is\my\dbfs\path). 'Can not create a Path from an empty string'
Your Environment
Kedro 0.15.5
Python 3.7.6
Windows 10
Databricks-connect 6.1.0
Possible implementation to fix
I can think of two ways of fixing this, but I'd like to hear your thoughts:
Create another special case for dbfs:// prefix in SparkDataSet.__init__ and create a PurePosixPath instead of a Path when the prefix is specified. This would still fail when the dbfs:// prefix is not specified, which creates an inconsistent behavior with Unix systems
Inspect the pyspark module at runtime to guess whether local Spark or databricks-connect is being used and based on that decide whether to use PurePosixPath or Path. I don't know if there is an elegant way of doing this.
The text was updated successfully, but these errors were encountered:
I realized in the documentation of SparkDataSet it hints to specify filepaths for (versioned) SparkDataSets starting with /dbfs/mnt, but I tried and got the same error.
yetudada
changed the title
SparkDataSet fails in Windows using Databricks connect when reading from DBFS
[KED-1475] SparkDataSet fails in Windows using Databricks connect when reading from DBFS
Mar 16, 2020
Description
Using a Windows machine with Databricks connect to load a
SparkDataSet
in DBFS throws an ExceptionContext
SparkDataSet
implementation in Windows creates the_filepath
attribute usingPath()
in__init__()
, except when the filepath prefix is for S3 or HDFS, in which case there is a special logic.This way a
WindowsPath
is generated by default, but if we want to reference a DBFS filepath if fails when loading the dataset because theWindowsPath
gets backslashes\
as separator when converting to a stringSteps to Reproduce
Expected Result
The DataSet gets actually loaded
Actual Result
An Exception is raised
Your Environment
Possible implementation to fix
I can think of two ways of fixing this, but I'd like to hear your thoughts:
dbfs://
prefix inSparkDataSet.__init__
and create aPurePosixPath
instead of aPath
when the prefix is specified. This would still fail when thedbfs://
prefix is not specified, which creates an inconsistent behavior with Unix systemspyspark
module at runtime to guess whether local Spark or databricks-connect is being used and based on that decide whether to usePurePosixPath
orPath
. I don't know if there is an elegant way of doing this.The text was updated successfully, but these errors were encountered: