Unable to interface with data written from Spark Databricks #1651

ikstewa · 2023-09-20T20:11:03Z

Environment

Delta-rs version: 0.10.2

Binding: python / rust

Environment:

Azure
Locally
Spark Databricks + Azure

Bug

What happened:
When attempting to interface with Databricks we're getting inconsistent results for the encoding of the partition resulting in inability to interface across clients.

After the fix from #1613 we're getting closer but still not consistent with Databricks.

Python

When writing from python the partitions are formatted as:
partition_date=2023-09-15%2000%3A00%3A00.000000

Rust

When writing from rust we see it as:
partition_date=2023-09-15 00:00:00

Databricks

When writing from Spark Databricks to Delta Lake we see partial encoding:
partition_date=2023-09-15 00%3A00$3A00

What you expected to happen:

Would expect to have consistent encoding across platforms.

How to reproduce it:

Write to Azure using Databricks, see partition layout.

Sample Python run locally:

import datetime
from deltalake import write_deltalake
import pyarrow as pa

data = pa.table({"data": pa.array(["mydata"]),
                 "inserted_at": pa.array([datetime.datetime.now()]),
                 "partition_column": pa.array([datetime.datetime(2023, 9, 15)])})



write_deltalake(table_or_uri="./unqueryable_table2", \
  mode="append", \
  data=data, \
  partition_by=["partition_column"]
)

The text was updated successfully, but these errors were encountered:

wjones127 · 2023-09-20T20:20:53Z

When attempting to interface with Databricks we're getting inconsistent results for the encoding of the partition resulting in inability to interface across clients.

When you say encoding of partition, are you referring to the one in the log? Or just the file paths?

FWIW, the file paths shouldn't be consequential as long as they can be read and recognized. The partition values are taken from the log, not the directory structure.

ikstewa · 2023-09-20T22:00:37Z

I was referring to the file paths.

I'll do more digging and see how the partitions in the logs compare and update here with more findings.

wjones127 · 2023-09-20T22:14:04Z

To put another way, are you actually getting an error or failure? Or are you just confused by what the file paths look like?

caseyrathbone · 2023-09-21T22:39:32Z

I still seem to be running into issues reading from delta tables partitioned by datetime

# write_table.py
import datetime
from deltalake import write_deltalake
import pyarrow as pa

data = pa.table({"id": pa.array([425], type=pa.int32()),
                 "data": pa.array(["python-module-test-write"]),
                 "t": pa.array([datetime.datetime(2023, 9, 15)])})

write_deltalake(table_or_uri="./dt", \
  mode="append", \
  data=data, \
  partition_by=["t"]
)

# read_table.py
from deltalake import DeltaTable

dt = DeltaTable(table_uri="./dt")
dataset = dt.to_pyarrow_dataset()

print(dataset.count_rows())

> python read_table.py
Traceback (most recent call last):
  File "/Users/crathbone/offline-spark/simple/read_table.py", line 4, in <module>
    dataset = dt.to_pyarrow_dataset()
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/deltalake/table.py", line 540, in to_pyarrow_dataset
    for file, part_expression in self._table.dataset_partitions(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/scalar.pxi", line 88, in pyarrow.lib.Scalar.cast
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: error parsing '2023-09-15%2000%3A00%3A00.000000' as scalar of type timestamp[us]

ikstewa added the bug Something isn't working label Sep 20, 2023

ikstewa mentioned this issue Sep 20, 2023

Error reading partitioned table with slash in partition key #1228

Closed

roeap mentioned this issue Sep 24, 2023

fix: more consistent handling of partition values and file paths #1661

Merged

rtyler closed this as completed in #1661 Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to interface with data written from Spark Databricks #1651

Unable to interface with data written from Spark Databricks #1651

ikstewa commented Sep 20, 2023 •

edited

Loading

wjones127 commented Sep 20, 2023

ikstewa commented Sep 20, 2023

wjones127 commented Sep 20, 2023

caseyrathbone commented Sep 21, 2023

Unable to interface with data written from Spark Databricks #1651

Unable to interface with data written from Spark Databricks #1651

Comments

ikstewa commented Sep 20, 2023 • edited Loading

Environment

Bug

Python

Rust

Databricks

wjones127 commented Sep 20, 2023

ikstewa commented Sep 20, 2023

wjones127 commented Sep 20, 2023

caseyrathbone commented Sep 21, 2023

ikstewa commented Sep 20, 2023 •

edited

Loading