Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_deltalake fails writing a simple dataset with categorical columns #1326

Closed
torshind opened this issue May 2, 2023 · 8 comments
Closed
Labels
binding/rust Issues for the Rust crate bug Something isn't working help wanted Extra attention is needed

Comments

@torshind
Copy link

torshind commented May 2, 2023

Environment

Delta-rs version: 0.8.1

Binding: python

Environment:

  • OS: Ubuntu 22.04.2 WSL2
  • Other: Python 3.10.6

Bug

What happened:
write_deltalake fails writing a simple dataset with categorical columns

Traceback (most recent call last):
  File "/home/mvm/gitlab/lakehouse/bug.py", line 5, in <module>
    write_deltalake("./table_test", data=df, mode="overwrite")
  File "/home/mvm/.pyenv/versions/studio-py3.10/lib/python3.10/site-packages/deltalake/writer.py", line 312, in write_deltalake
    _write_new_deltalake(
deltalake.PyDeltaTableError: Schema error: Invalid data type for Delta Lake: Dictionary(Int8, Utf8)

What you expected to happen:
write_deltalake not to fail

How to reproduce it:
Minimal test case to reproduce it:

from deltalake.writer import write_deltalake
from sklearn.datasets import fetch_openml

df = fetch_openml(name="arrhythmia", version=1, as_frame=True)["data"]
write_deltalake("./table_test", data=df, mode="overwrite")
@torshind torshind added the bug Something isn't working label May 2, 2023
@wjones127
Copy link
Collaborator

Related to: #686

@wjones127
Copy link
Collaborator

I think for dictionary types (and other future encoded types, such as REE), we should develop some facility for mapping to a canonical "logical type". So Dictionary(Int8, Utf8) should map to Utf8 (and so should LargeUtf8), which we know is supported.

@kangshung
Copy link

When can we expect this to be implemented?

@rtyler rtyler added help wanted Extra attention is needed binding/rust Issues for the Rust crate labels Jan 3, 2024
@itamarst
Copy link

itamarst commented May 21, 2024

I tried reproducing the given script, and it no longer errors. It does in fact write out table_test. The resulting columns have different types though. E.g. the chV1_DD_RTwaveExists column starts out as a categorical with values "0" and "1". Once you write it out using the reproducer script above and then read the resulting table, you get a column type of string.

@itamarst
Copy link

itamarst commented May 21, 2024

So, in some sense this is fixed. However, the ideal would be to preserve categoricals as categoricals, rather than strings. Is there a technical reason this can't be done, or is it just a matter of someone having the time? If the latter, what specifically needs doing?

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented May 21, 2024

So, in some sense this is fixed. However, the ideal would be to preserve categoricals as categoricals, rather than strings. Is there a technical reason this can't be done, or is it just a matter of someone having the time? If the latter, what specifically needs doing?

This is not possible. Categorical is not a supported primitive type in the delta protocol.

If you would like to have it be a supported type, you need to post an RFC in the main delta repo. Only once it's introduced there in the protocol, we can add support

@itamarst
Copy link

Thank you, that's good to know.

@itamarst
Copy link

Given the reproducer no longer fails, if that matches other people's results then this issue can be closed?

@ion-elgreco ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Aug 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants