Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glue scan with filter throws list index out of range #1804

Closed
1 of 3 tasks
Cabeda opened this issue Mar 18, 2025 · 1 comment · Fixed by #1901
Closed
1 of 3 tasks

Glue scan with filter throws list index out of range #1804

Cabeda opened this issue Mar 18, 2025 · 1 comment · Fixed by #1901

Comments

@Cabeda
Copy link

Cabeda commented Mar 18, 2025

Apache Iceberg version

0.9.0 (latest release)

Please describe the bug 🐞

Hi,

Not sure if this is a bug but worst case scenario this might be something for other to look up into in the future.

I've created a table like follows using pyiceberg

            schema = Schema(
                NestedField(field_id=1, name="bk_id", field_type=StringType(), required=False),
                NestedField(field_id=2, name="inference_date", field_type=TimestampType(), required=False),
                NestedField(field_id=3, name="verified", field_type=BooleanType(), required=False),
                NestedField(field_id=4, name="id", field_type=StringType(), required=True),
            )

I've been able to do multiple appends to the table using pyiceberg with no issues.

Now, to run some tests and prepare to use the new upsert operation, I decided do append a row with id = 'dummy_id', and then run a scan filtering by it. When I do the scan through AWS Athena I see the row, however, when doing the scan with dummy = table.scan(row_filter=EqualTo("id", 'dummy_id')) I get list index out of range. This seems to be because pyiceberg isn't able to retrieve the row.

Here is the code I have setup to replicate the issue:

from pyiceberg.expressions import EqualTo
import pyarrow as pa

df = pa.Table.from_pydict(
        {
            "bk_id": ["BK123456"],
            "inference_date": [pd.Timestamp.now()],
            "verified": [False],
            "id": ["dummy_id"],
        }
    )


catalog = load_catalog(
        "glue",
        **{
            "type": "glue",
            "warehouse": warehouse_path,
            "downcast-ns-timestamp-to-us-on-write": True,
        },
    )

table_identifier = "database_name.table_name"
table = catalog.load_table(table_identifier)


table.append(df)


dummy = table.scan(row_filter=EqualTo("id", 'dummy_id'))
dummy.to_arrow()

Is there something I'm doing wrong?

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@Cabeda
Copy link
Author

Cabeda commented Mar 25, 2025

Seems like the issue is due to apache/arrow#44366

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant