Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use with delta-sharing.io reference data #6

Open
robkroll opened this issue Jan 9, 2023 · 3 comments
Open

Unable to use with delta-sharing.io reference data #6

robkroll opened this issue Jan 9, 2023 · 3 comments

Comments

@robkroll
Copy link

robkroll commented Jan 9, 2023

Hello. I'm new to R, so forgive the beginner question, but I'm unable to use the connector against the reference data provided by delta-sharing.io.

I've built the package and I'm running it this way:

install.packages("C:\\temp\\delta.sharing_0.1.1.zip", repos = NULL, type="source")
library("arrow")

profile_path = "C:\\temp\\open-datasets.share"
download_path <- "C:\\temp\\share-download"
share <- "delta_sharing"
schema <- "default"
table <- "lending_club"

library(delta.sharing)
client <- sharing_client(profile_path)

# table class
ds_tbl <- client$table(share, schema, table)

# (optional) specify a limit (best effort to enforce)
ds_tbl$set_limit(limit = 1000)
ds_tbl$limit

# (optional) where to download files (before arrow kicks in)
ds_tbl$set_download_path(download_path)

# just want a tibble? (alias for collect on arrow)
ds_tbl_tibble <- ds_tbl$load_as_tibble()

# write the tibble out to CSV
write.table(ds_tbl_tibble, file = file.path(download_path, "tibble.csv"))

However, I get the following output:

Error in `arrow::open_dataset()`:
! IOError: Error creating dataset. Could not read schema from 'C:/temp/share-download/fe9ef647-d848-476b-afbe-9083b532ec24/0df8a546325957122d72659e2ca8edc1.parquet': Could not open Parquet input source 'C:/temp/share-download/fe9ef647-d848-476b-afbe-9083b532ec24/0df8a546325957122d72659e2ca8edc1.parquet': Couldn't deserialize thrift: don't know what type: �
. Is this a 'parquet' file?

Any help would be appreciated. Thanks!

@zacdav-db
Copy link
Owner

Hey Robert,

Thanks for filing the issue - I can confirm that I can reproduce the same error on both macOS/Windows machines.

Rest assured that you are doing the correct thing, I am currently investigating if there were silent changes pushed to the delta sharing server/protocol.

Is your share created with Databricks or open source delta sharing?

@zacdav-db
Copy link
Owner

zacdav-db commented Jan 9, 2023

@robkroll can you please try:

arrow::read_parquet("C:/temp/share-download/fe9ef647-d848-476b-afbe-9083b532ec24/0df8a546325957122d72659e2ca8edc1.parquet")

or if you can install {duckdb}:

install.packages("duckdb")
con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbGetQuery(
  con,
  "select * from read_parquet('C:/temp/share-download/fe9ef647-d848-476b-afbe-9083b532ec24/0df8a546325957122d72659e2ca8edc1.parquet')"
)

I'm expecting you'll see a similar error for both.

@robkroll
Copy link
Author

robkroll commented Jan 9, 2023

Hi @zacdav-db, thanks for the reply.

Is your share created with Databricks or open source delta sharing?

I'm firstly just trying to use the reference server that has been set up from the Delta Sharing README, so I've downloaded the open-datasets.share file. I did need to edit it for use with the R connector, because it needed an expirationTime property in the future, so my open-datasets.share file looks like:

{
  "shareCredentialsVersion": 1,
  "endpoint": "https://sharing.delta.io/delta-sharing/",
  "bearerToken": "faaie590d541265bcab1f2de9813274bf233",
  "expirationTime": "2023-12-29T13:31:21"
}

I'm expecting you'll see a similar error for both.

Yes, both directly reading using arrow and duckdb gives me the same error don't know what type: �.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants