Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faulty behavior on nullable columns when using DeltaTable.merge #2991

Closed
joepatol opened this issue Nov 13, 2024 · 2 comments · Fixed by #3026
Closed

Faulty behavior on nullable columns when using DeltaTable.merge #2991

joepatol opened this issue Nov 13, 2024 · 2 comments · Fixed by #3026
Labels
bug Something isn't working

Comments

@joepatol
Copy link

joepatol commented Nov 13, 2024

Environment

Delta-rs version: 0.21.0

Binding: Python

Environment:

  • Cloud provider: local & Azure
  • OS: Ubuntu
  • Other:

Bug

There seems to be faulty behavior in DeltaTable.merge when columns are non-nullable. (i.e. nullable=False).

What happened:
if I merge an empty table with None values the first time, this operation runs ok. The second time the operation fails.

What you expected to happen:
I expect if a column is non-nullable, trying to write a null / None value to it, would always immediately raise an error.

How to reproduce it:
Run the following script, make sure the table does not exist:

from deltalake.schema import Schema, Field, PrimitiveType
from deltalake import DeltaTable
import pandas as pd
import os
from pathlib import Path

TABLE_PATH = str(Path(os.path.dirname(__file__)) / "my_table")

schema = Schema([
    Field("id", PrimitiveType("integer"), nullable=False),
    Field("bool", PrimitiveType("boolean"), nullable=False),
])


df1 = pd.DataFrame(
    columns=["id", "bool"],
    data=[
        [1, True],
        [2, None],
        [3, False],
    ],
)


df2 = pd.DataFrame(
    columns=["id", "bool"],
    data=[
        [4, False],
        [5, None],
        [6, True],
    ],
)


if not DeltaTable.is_deltatable(TABLE_PATH):
    DeltaTable.create(TABLE_PATH, schema=schema)

for df in [df1, df2]:
    dt = DeltaTable(TABLE_PATH)

    upsert_result = dt.merge(
        source=df,
        source_alias="s",
        target_alias="t",
        predicate="s.id = t.id",
    ) \
        .when_matched_update_all() \
        .when_not_matched_insert_all() \
        .execute()

    print(upsert_result)

    print(dt.to_pandas())

Running this the first time yields:

{'num_source_rows': 3, 'num_target_rows_inserted': 3, 'num_target_rows_updated': 0, 'num_target_rows_deleted': 0, 'num_target_rows_copied': 0, 'num_output_rows': 3, 'num_target_files_scanned': 0, 'num_target_files_skipped_during_scan': 0, 'num_target_files_added': 1, 'num_target_files_removed': 0, 'execution_time_ms': 22, 'scan_time_ms': 0, 'rewrite_time_ms': 3}
   id   bool
0   2   None
1   3  False
2   1   True
{'num_source_rows': 3, 'num_target_rows_inserted': 3, 'num_target_rows_updated': 0, 'num_target_rows_deleted': 0, 'num_target_rows_copied': 0, 'num_output_rows': 3, 'num_target_files_scanned': 0, 'num_target_files_skipped_during_scan': 1, 'num_target_files_added': 1, 'num_target_files_removed': 0, 'execution_time_ms': 24, 'scan_time_ms': 0, 'rewrite_time_ms': 2}
   id   bool
0   4  False
1   6   True
2   5   None
3   2   None
4   3  False
5   1   True

Now run the script again, this yields:

Traceback (most recent call last):
  File "/home/joepatol/RIKS_repos/sbx/test_dataproduct_local/deltalake_issues/nullable_boolean.py", line 50, in <module>
    .execute()
     ^^^^^^^^^
  File "/home/joepatol/RIKS_repos/sbx/.venv311/lib/python3.11/site-packages/deltalake/table.py", line 1799, in execute
    metrics = self._table.merge_execute(self._builder)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_internal.DeltaError: Generic DeltaTable error: External error: External error: Arrow error: Invalid argument error: Found unmasked nulls for non-nullable StructArray field "bool"

More details:
I tested this on a table thats written to my local filesystem, to Azure storage and to an Azurite emulator, all with the same result. I also tested with different column types, again the same result.

@joepatol joepatol added the bug Something isn't working label Nov 13, 2024
@joepatol
Copy link
Author

It, understandably, seems that table is just broken after a null value has been written.

E.g. first merging:

pd.DataFrame(
    columns=["id", "bool"],
    data=[
        [1, True],
        [2, True],
        [3, True],
    ],
)

is Ok

Now merge:

pd.DataFrame(
    columns=["id", "bool"],
    data=[
        [1, True],
        [2, None],
        [3, True],
    ],
)

No error

Now merge:

pd.DataFrame(
    columns=["id", "bool"],
    data=[
        [1, True],
        [2, True],
        [3, True],
    ],
)

Yields the error described above

@edfreeman
Copy link

I've just noticed the same behaviour. .when_not_matched_insert_all() will always allow you to insert null into non-nullable columns when rows don't already exist based on the predicate.

And, as shown in this issue, .when_matched_update_all() will happily allow you to overwrite previously valid values with invalid null values, but after that happens those rows are no longer possible to update.

Conversely, a simple write_deltalake("../../", df, mode='append') blocks nulls as expected,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants