Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-1475] SparkDataSet fails in Windows using Databricks connect when reading from DBFS #277

Closed
MigQ2 opened this issue Mar 6, 2020 · 2 comments
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed

Comments

@MigQ2
Copy link
Contributor

MigQ2 commented Mar 6, 2020

Description

Using a Windows machine with Databricks connect to load a SparkDataSet in DBFS throws an Exception

Context

SparkDataSet implementation in Windows creates the _filepath attribute using Path() in __init__(), except when the filepath prefix is for S3 or HDFS, in which case there is a special logic.
This way a WindowsPath is generated by default, but if we want to reference a DBFS filepath if fails when loading the dataset because the WindowsPath gets backslashes \ as separator when converting to a string

Steps to Reproduce

  1. Load a SparkDataSet where the filepath references DBFS in a Windows machine using Databricks-connect

Expected Result

The DataSet gets actually loaded

Actual Result

An Exception is raised

kedro.io.core.DataSetError: Failed while loading data from data set SparkDataSet(file_format=parquet, filepath=dbfs://mnt\this\is\my\dbfs\path). 'Can not create a Path from an empty string' 

Your Environment

  • Kedro 0.15.5
  • Python 3.7.6
  • Windows 10
  • Databricks-connect 6.1.0

Possible implementation to fix

I can think of two ways of fixing this, but I'd like to hear your thoughts:

  1. Create another special case for dbfs:// prefix in SparkDataSet.__init__ and create a PurePosixPath instead of a Path when the prefix is specified. This would still fail when the dbfs:// prefix is not specified, which creates an inconsistent behavior with Unix systems
  2. Inspect the pyspark module at runtime to guess whether local Spark or databricks-connect is being used and based on that decide whether to use PurePosixPath or Path. I don't know if there is an elegant way of doing this.
@MigQ2 MigQ2 added the Issue: Bug Report 🐞 Bug that needs to be fixed label Mar 6, 2020
@MigQ2
Copy link
Contributor Author

MigQ2 commented Mar 6, 2020

I realized in the documentation of SparkDataSet it hints to specify filepaths for (versioned) SparkDataSets starting with /dbfs/mnt, but I tried and got the same error.

@yetudada yetudada changed the title SparkDataSet fails in Windows using Databricks connect when reading from DBFS [KED-1475] SparkDataSet fails in Windows using Databricks connect when reading from DBFS Mar 16, 2020
@andrii-ivaniuk
Copy link
Contributor

andrii-ivaniuk commented Apr 6, 2020

It was resolved and will be available in the future release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed
Projects
None yet
Development

No branches or pull requests

2 participants