Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Partition value cannot be parsed from string. #2380

Closed
thomasfrederikhoeck opened this issue Apr 4, 2024 · 9 comments · Fixed by #2383
Closed

ValueError: Partition value cannot be parsed from string. #2380

thomasfrederikhoeck opened this issue Apr 4, 2024 · 9 comments · Fixed by #2383
Labels
bug Something isn't working

Comments

@thomasfrederikhoeck
Copy link
Contributor

Environment

Delta-rs version:
Main including 6f81b80
Binding:
python

Environment:

  • Cloud provider: local
  • OS: Windows
  • Other:

Bug

What happened:
When I try to create a checkpoint on a table partioned by timestamp I'm hit with a ValueError. Note that I have build from master including #2357:

import pandas as pd
from datetime import datetime
import deltalake as dl

dates = pd.date_range(datetime(2021,1,1,3,4,6,3),datetime(2021,1,3,3,4,6))

df = pd.DataFrame({"time":dates, "a":[i for i in range(len(dates))]})

schema = dl.schema.Schema(fields=[
    dl.schema.Field("time",dl._internal.PrimitiveType.from_json('"timestamp"')),
    dl.schema.Field("a",dl._internal.PrimitiveType.from_json('"integer"'))
    ]
) 

write_deltalake("mytable",df, schema=schema,partition_by="time")
dt = DeltaTable("mytable")
dt.create_checkpoint()

which gives:

ValueError: Partition value 2021-01-02 03:04:06.000003Z cannot be parsed from string.

What you expected to happen:
That the checkpoint was created.
How to reproduce it:
Run code above

More details:

@thomasfrederikhoeck
Copy link
Contributor Author

it appears that the Z is missing in the parsing:

PrimitiveType::Timestamp => {

@ion-elgreco
Copy link
Collaborator

@thomasfrederikhoeck it shouldn't write the partition values with a Z. Also my PR didn't touch the partition value serialization.

@ion-elgreco
Copy link
Collaborator

@thomasfrederikhoeck This issue seems to be related to how pyarrow engine is serializing the partition values

@thomasfrederikhoeck
Copy link
Contributor Author

Yes it appears that pyarrow serialize timestamp with Z while and timestampNtz without.

import pyarrow as pa
import pytz

tz = "UTC"

def get_data(with_tz):
    tzinfo = pytz.timezone(tz) if  with_tz else None
    dates = pd.date_range(
        datetime(2021,1,1,3,4,6,3, tzinfo=tzinfo),
        datetime(2021,1,3,3,4,6, tzinfo=tzinfo)
        )
    return pd.DataFrame({"time":dates, "a":[i for i in range(len(dates))]})

schema = pa.schema(
        [
            ("time", pa.timestamp("us")),
            ("a", pa.int64()),
        ]
    )
dt = DeltaTable.create(
        "mytable_timestampNtz", schema=schema, partition_by=["time"]
    )

write_deltalake("mytable_timestampNtz",get_data(with_tz=False), partition_by="time", mode="append")
print(dt.schema())
schema = pa.schema(
        [
            ("time", pa.timestamp("us",tz)),
            ("a", pa.int64()),
        ]
    )
dt = DeltaTable.create(
        "mytable_timestamp", schema=schema, partition_by=["time"]
    )

write_deltalake("mytable_timestamp",get_data(with_tz=True), partition_by="time", mode="append")
print(dt.schema())

>Schema([Field(time, PrimitiveType("timestampNtz"), nullable=True), Field(a, PrimitiveType("long"), nullable=True)])
>Schema([Field(time, PrimitiveType("timestamp"), nullable=True), Field(a, PrimitiveType("long"), nullable=True)])

image

@thomasfrederikhoeck
Copy link
Contributor Author

thomasfrederikhoeck commented Apr 4, 2024

@ion-elgreco I wanted to try the rust engine but the problem is that it serialize like this which is invalid on Windows where you can't have colon (:) it the folder or file name: 2021-01-01 03:04:06.000003

OSError: Generic LocalFileSystem error: Unable to open file C:\projects\delta-rs\mytable_timestamp\time(=2021-01-02 03:04:06.000003\part-00001-a361470e-2514-4309-ae6f-153e877e3f51-c000.snappy.parquet#1: The filename, directory name, or volume label syntax is incorrect. (os error 123)

@ion-elgreco
Copy link
Collaborator

@thomasfrederikhoeck can you make a separate issue for that

@thomasfrederikhoeck
Copy link
Contributor Author

thomasfrederikhoeck commented Apr 4, 2024

Yes, #2382 :-) @ion-elgreco

@ion-elgreco
Copy link
Collaborator

Yes it appears that pyarrow serialize timestamp with Z while and timestampNtz without.

import pyarrow as pa
import pytz

tz = "UTC"

def get_data(with_tz):
    tzinfo = pytz.timezone(tz) if  with_tz else None
    dates = pd.date_range(
        datetime(2021,1,1,3,4,6,3, tzinfo=tzinfo),
        datetime(2021,1,3,3,4,6, tzinfo=tzinfo)
        )
    return pd.DataFrame({"time":dates, "a":[i for i in range(len(dates))]})

schema = pa.schema(
        [
            ("time", pa.timestamp("us")),
            ("a", pa.int64()),
        ]
    )
dt = DeltaTable.create(
        "mytable_timestampNtz", schema=schema, partition_by=["time"]
    )

write_deltalake("mytable_timestampNtz",get_data(with_tz=False), partition_by="time", mode="append")
print(dt.schema())
schema = pa.schema(
        [
            ("time", pa.timestamp("us",tz)),
            ("a", pa.int64()),
        ]
    )
dt = DeltaTable.create(
        "mytable_timestamp", schema=schema, partition_by=["time"]
    )

write_deltalake("mytable_timestamp",get_data(with_tz=True), partition_by="time", mode="append")
print(dt.schema())

>Schema([Field(time, PrimitiveType("timestampNtz"), nullable=True), Field(a, PrimitiveType("long"), nullable=True)])
>Schema([Field(time, PrimitiveType("timestamp"), nullable=True), Field(a, PrimitiveType("long"), nullable=True)])

image

@thomasfrederikhoeck for this one, can you also create a separate issue? : P

@thomasfrederikhoeck
Copy link
Contributor Author

@ion-elgreco Done #2384 :-)

ion-elgreco added a commit that referenced this issue Apr 15, 2024
…missing timestampNtz deserialization (#2383)

# Description
Our timestamp deserialization format didn't include the %6f to decode
this value: 1970-01-01 00:00:00.123456. Additionally during timestampNtz
I didn't add deserialization of that primitive type :)


- fixes #2380
- fixes #2381
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants