-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support PyArrow timestamptz with Etc/UTC #910
Changes from 5 commits
7803fcd
fa3ba66
fd53138
12e6faa
bf70252
9375e87
ce643f6
2b504a3
413b4f7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -174,6 +174,7 @@ | |
MAP_KEY_NAME = "key" | ||
MAP_VALUE_NAME = "value" | ||
DOC = "doc" | ||
UTC_ALIASES = {"UTC", "+00:00", "Etc/UTC", "Z"} | ||
|
||
T = TypeVar("T") | ||
|
||
|
@@ -937,7 +938,7 @@ def primitive(self, primitive: pa.DataType) -> PrimitiveType: | |
else: | ||
raise TypeError(f"Unsupported precision for timestamp type: {primitive.unit}") | ||
|
||
if primitive.tz == "UTC" or primitive.tz == "+00:00": | ||
if primitive.tz in UTC_ALIASES: | ||
return TimestamptzType() | ||
elif primitive.tz is None: | ||
return TimestampType() | ||
|
@@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: pa.Array) -> pa.Array: | |
and pa.types.is_timestamp(values.type) | ||
and values.type.unit == "ns" | ||
): | ||
return values.cast(target_type, safe=False) | ||
if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz: | ||
return values.cast(target_type, safe=False) | ||
if ( | ||
pa.types.is_timestamp(target_type) | ||
and target_type.unit == "us" | ||
and pa.types.is_timestamp(values.type) | ||
and values.type.unit in {"s", "ms", "us"} | ||
): | ||
if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add parentheses? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added |
||
return values.cast(target_type) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this look correct that we are casting the types in order to follow the Iceberg Spec for Parquet Physical and Logical Types? https://iceberg.apache.org/spec/#parquet There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it looks good. Arrow writes |
||
return values | ||
|
||
def _construct_field(self, field: NestedField, arrow_type: pa.DataType) -> pa.Field: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -528,10 +528,6 @@ def append(self, df: pa.Table, snapshot_properties: Dict[str, str] = EMPTY_DICT) | |
) | ||
|
||
_check_schema_compatible(self._table.schema(), other_schema=df.schema) | ||
# cast if the two schemas are compatible but not equal | ||
table_arrow_schema = self._table.schema().as_arrow() | ||
if table_arrow_schema != df.schema: | ||
df = df.cast(table_arrow_schema) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed this cast function - |
||
|
||
manifest_merge_enabled = PropertyUtil.property_as_bool( | ||
self.table_metadata.properties, | ||
|
@@ -587,10 +583,6 @@ def overwrite( | |
) | ||
|
||
_check_schema_compatible(self._table.schema(), other_schema=df.schema) | ||
# cast if the two schemas are compatible but not equal | ||
table_arrow_schema = self._table.schema().as_arrow() | ||
if table_arrow_schema != df.schema: | ||
df = df.cast(table_arrow_schema) | ||
|
||
self.delete(delete_filter=overwrite_filter, snapshot_properties=snapshot_properties) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we're requesting
us
and the file provide aus
, I don't think we need to cast?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is a bit confusing, I agree 😓 . I included it here because I wanted to support casting
pa.timestamp('us', tz='Etc/UTC')
topa.timestamp('us', tz='UTC')
within the same condition.I think we won't hit this condition if both the input and requested types are
pa.timestamp('us')
because we enter this block only if target_type and values.type are not equal:https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1315
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need to do some more work here. In Iceberg we're rather strict on the distinction between Timestamp and TimestampTZ. A good way of showing this can be found here:
iceberg-python/pyiceberg/expressions/literals.py
Lines 566 to 572 in aceed2a
This is when we parse a string from a literal, which often comes an expression:
dt >= '1925-05-22T00:00:00'
. If the value has a timezone, even UTC, we reject it asTimestamp
. When it has a timestamp, we normalize it to UTC and then store the integer.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good @Fokko- thank you for the review.
I've made these checks stricter and also clearer for users to follow.