Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparkHiveDataset Does not work on Databricks #1001

Closed
vihag opened this issue Oct 28, 2021 · 2 comments
Closed

SparkHiveDataset Does not work on Databricks #1001

vihag opened this issue Oct 28, 2021 · 2 comments
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed

Comments

@vihag
Copy link

vihag commented Oct 28, 2021

Description

When trying to refer to a database/table in Databricks we get
kedro.io.core.DataSetError: Failed while loading data from data set SparkHiveDataSet
cannot resolve 'namespace' given input columns: [databaseName];;

Context

We want to access data catalogs within Databricks with reference to Database-Table pairs instead of dbfs file paths for better governance. SparkHiveDataset is the current method until the Delta Dataset implementation comes in.

Steps to Reproduce

  1. Create a spark hive dataset
  2. Run a kedro pipeline with dbconnect flow and observe the error
  3. [And so on...]

Expected Result

Ideally it should be able to load the data in the table and process further

Actual Result

we get an error

###My Debug
The issue comes from this line in the codebase the column name "namespace" is hardcoded. in Databricks the command show databases results in a column called databaseNames - hence the above error on not being able to resolve the column

kedro.io.core.DataSetError: Failed while loading data from data set SparkHiveDataSet

-- Separate them if you have more than one.
cannot resolve '`namespace`' given input columns: [databaseName];;

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used: 0.17.5
  • Python version used : 3.7
  • Operating system and version: Mac/BigSur
  • Databricks: Azure, DBR 7.1 LTS
@vihag vihag added the Issue: Bug Report 🐞 Bug that needs to be fixed label Oct 28, 2021
@datajoely
Copy link
Contributor

Hi @vihag our delta dataset implementation will be in release we want to get out in the next month, until then it's probably easier for you to subclass The HiveDataSet you want to use and override/extend to work for your purposes.

@stale
Copy link

stale bot commented Dec 27, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 27, 2021
@stale stale bot closed this as completed Jan 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed
Projects
None yet
Development

No branches or pull requests

2 participants