[KED-1475] SparkDataSet fails in Windows using Databricks connect when reading from DBFS #277

MigQ2 · 2020-03-06T11:46:00Z

Description

Using a Windows machine with Databricks connect to load a SparkDataSet in DBFS throws an Exception

Context

SparkDataSet implementation in Windows creates the _filepath attribute using Path() in __init__(), except when the filepath prefix is for S3 or HDFS, in which case there is a special logic.
This way a WindowsPath is generated by default, but if we want to reference a DBFS filepath if fails when loading the dataset because the WindowsPath gets backslashes \ as separator when converting to a string

Steps to Reproduce

Load a SparkDataSet where the filepath references DBFS in a Windows machine using Databricks-connect

Expected Result

The DataSet gets actually loaded

Actual Result

An Exception is raised

kedro.io.core.DataSetError: Failed while loading data from data set SparkDataSet(file_format=parquet, filepath=dbfs://mnt\this\is\my\dbfs\path). 'Can not create a Path from an empty string'

Your Environment

Kedro 0.15.5
Python 3.7.6
Windows 10
Databricks-connect 6.1.0

Possible implementation to fix

I can think of two ways of fixing this, but I'd like to hear your thoughts:

Create another special case for dbfs:// prefix in SparkDataSet.__init__ and create a PurePosixPath instead of a Path when the prefix is specified. This would still fail when the dbfs:// prefix is not specified, which creates an inconsistent behavior with Unix systems
Inspect the pyspark module at runtime to guess whether local Spark or databricks-connect is being used and based on that decide whether to use PurePosixPath or Path. I don't know if there is an elegant way of doing this.

The text was updated successfully, but these errors were encountered:

MigQ2 · 2020-03-06T12:05:59Z

I realized in the documentation of SparkDataSet it hints to specify filepaths for (versioned) SparkDataSets starting with /dbfs/mnt, but I tried and got the same error.

andrii-ivaniuk · 2020-04-06T11:45:44Z

It was resolved and will be available in the future release.

MigQ2 added the Issue: Bug Report 🐞 Bug that needs to be fixed label Mar 6, 2020

yetudada changed the title ~~SparkDataSet fails in Windows using Databricks connect when reading from DBFS~~ [KED-1475] SparkDataSet fails in Windows using Databricks connect when reading from DBFS Mar 16, 2020

andrii-ivaniuk closed this as completed Apr 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KED-1475] SparkDataSet fails in Windows using Databricks connect when reading from DBFS #277

[KED-1475] SparkDataSet fails in Windows using Databricks connect when reading from DBFS #277

MigQ2 commented Mar 6, 2020

MigQ2 commented Mar 6, 2020

andrii-ivaniuk commented Apr 6, 2020 •

edited

Loading

[KED-1475] SparkDataSet fails in Windows using Databricks connect when reading from DBFS #277

[KED-1475] SparkDataSet fails in Windows using Databricks connect when reading from DBFS #277

Comments

MigQ2 commented Mar 6, 2020

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Your Environment

Possible implementation to fix

MigQ2 commented Mar 6, 2020

andrii-ivaniuk commented Apr 6, 2020 • edited Loading

andrii-ivaniuk commented Apr 6, 2020 •

edited

Loading