Arrow: Allow missing field-ids from Schema #183

Fokko · 2023-12-05T12:32:59Z

No description provided.

HonahX

To clarify my understanding, I think this fallback is expected to function correctly only if the table schema's field IDs are generated in the same way as in the visitor or assign_fresh_field_id. This should generally hold true, as the official Iceberg API always reassigns schema field IDs when creating a table. Please let me know if I've misunderstood anything here. Thanks!

HonahX · 2023-12-07T07:51:32Z

pyiceberg/io/pyarrow.py


+    def __init__(self) -> None:
+        self.counter = count()


Shall we use count(1) here since we start from 1 inassign_fresh_schema_ids

iceberg-python/pyiceberg/schema.py

Lines 1221 to 1230 in 4616d03

class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):

"""Traverses the schema and assigns monotonically increasing ids."""

reserved_ids: Dict[int, int]

def __init__(self, next_id_func: Optional[Callable[[], int]] = None) -> None:

self.reserved_ids = {}

counter = itertools.count(1)

self.next_id_func = next_id_func if next_id_func is not None else lambda: next(counter)

Sorry for the limited context here, it will skip the fields if it doesn't have an ID:

Which is kind of awkward.

I like the idea of using assign_fresh_schema_ids, since that one is a pre-order, and the default is post-order. I've updated the code, let me know what you think! Appreciate the review!

I happen to have an iceberg table (migrated from delta lake) whose parquet files contain no field-id. With this change, I am now able to use pyiceberg to read its data. This is really great! Out of curiosity, are there any additional use-cases where this PR might be beneficial?

I happen to have an iceberg table (migrated from delta lake) whose parquet files contain no field-id. With this change, I am now able to use pyiceberg to read its data.

There is already a way to assign field IDs when they are not in a data file, using a name mapping. All reads that need to infer field IDs must use a name mapping rather than assigning IDs per data file.

HonahX

Based on my current understanding, this Pull Request can be a temporal solution for managing parquet files lacking field-ids, until the Name-mapping feature is implemented. Overall, it seems good to me! Just have one suggestion on adding more details in the warning message

Additionally, I want to point out some prerequisites for the field-id assignment process to work when reading tables:

The table_schema must be processed with assign_field_id (or its equivalent function in Java) before being written to the table.
The column order in the parquet file should align with that in the table_schema.

If these pre-reqs are not met, we might encounter errors during column binding and data reading. I think this is the reason that we want Name-mapping finally. Please let me know if there's any aspect of this I might be misunderstanding.

Thanks!

pyiceberg/io/pyarrow.py

Fokko · 2023-12-09T20:14:35Z

Thanks for the Java context here! Appreciate it!

The table_schema must be processed with assign_field_id (or its equivalent function in Java) before being written to the table.

This will still be the case, but this aims for when we're just reading a Parquet file (that doesn't have field-id metadata), and turning that dataframe into a table. So this is more on the read than the write side. Currently, if you transform a field, it will just drop the fields which is not what we want.

The column order in the parquet file should align with that in the table_schema.

I agree, we could see if we can have a way to wire up the names. That's a great idea! Could you create an issue on that?

If these pre-reqs are not met, we might encounter errors during column binding and data reading. I think this is the reason that we want Name-mapping finally. Please let me know if there's any aspect of this I might be misunderstanding.

I don't see the link with name-mapping. Name mapping will provide an external mapping of column names to ID. But this is more about writing new data to a new Iceberg table where there are no IDs.

Just thinking out loud from a practical perspective. If you retrain a model, you have results, and you want to overwrite an existing table that has the same column, then you want to wire up those names then we might want to defer the assignment of IDs until we write. I think we can elaborate on this.

HonahX · 2023-12-10T21:48:50Z

Thanks for the detailed explanation! I was originally looking at how this PR helps with reading, especially focusing on the code here:

iceberg-python/pyiceberg/io/pyarrow.py

Lines 827 to 831 in 043aba5

    
           if metadata := physical_schema.metadata: 
        
               schema_raw = metadata.get(ICEBERG_SCHEMA) 
        
           # TODO: if field_ids are not present, Name Mapping should be implemented to look them up in the table schema, 
        
           #  see https://github.com/apache/iceberg/issues/7451 
        
           file_schema = Schema.model_validate_json(schema_raw) if schema_raw is not None else pyarrow_to_schema(physical_schema)

. That's why I brought up name-mapping.

But now I totally get the write-side importance too. Based on your suggestion, I've opened Issue #199. Would love your thoughts on it or any suggestions you might have!

HonahX

LGTM!

sungwy

LGTM as well. I was running into an issue reading iceberg tables that had data files side-loaded into the Iceberg Table with add_files migration procedure, and this PR will resolve the issue.

/lib/python3.10/site-packages/pyiceberg/io/pyarrow.py in _get_field_id(field)
    703 def _get_field_id(field: pa.Field) -> Optional[int]:
    704     for pyarrow_field_id_key in PYARROW_FIELD_ID_KEYS:
--> 705         if field_id_str := field.metadata.get(pyarrow_field_id_key):
    706             return int(field_id_str.decode())
    707     return None

AttributeError: 'NoneType' object has no attribute 'get'

rdblue · 2023-12-12T21:20:38Z

Per my comment, I'm -1 if the intent of this PR is to read data files without IDs. That must use a name mapping in order to be safe!

Arrow: Allow missing field-ids from Schema

405d36c

Fokko force-pushed the fd-make-field-ids-optional branch from e5923dd to 405d36c Compare December 5, 2023 12:43

HonahX reviewed Dec 7, 2023

View reviewed changes

Thanks Honah!

27017cf

Fokko force-pushed the fd-make-field-ids-optional branch from d507fcd to 27017cf Compare December 7, 2023 15:20

HonahX reviewed Dec 8, 2023

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

Update pyiceberg/io/pyarrow.py

2657fe1

lint

447d22f

HonahX mentioned this pull request Dec 10, 2023

Write Support: Aligning Column Order in pyarrow.Table with Iceberg Table Schema for Write Operations #199

Closed

HonahX approved these changes Dec 10, 2023

View reviewed changes

sungwy approved these changes Dec 11, 2023

View reviewed changes

Fokko added this to the PyIceberg 0.6.0 release milestone Dec 13, 2023

sungwy mentioned this pull request Dec 17, 2023

Apply Name mapping #219

Merged

3 tasks

Fokko closed this Dec 19, 2023

HonahX mentioned this pull request Dec 20, 2023

Arrow: Set field-id with prefix #227

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow: Allow missing field-ids from Schema #183

Arrow: Allow missing field-ids from Schema #183

Fokko commented Dec 5, 2023

HonahX left a comment •

edited

Loading

HonahX Dec 7, 2023

Fokko Dec 7, 2023

Fokko Dec 7, 2023 •

edited

Loading

HonahX Dec 8, 2023

rdblue Dec 12, 2023

HonahX left a comment •

edited

Loading

Fokko commented Dec 9, 2023

HonahX commented Dec 10, 2023

HonahX left a comment

sungwy left a comment

rdblue commented Dec 12, 2023

	class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
	"""Traverses the schema and assigns monotonically increasing ids."""

	reserved_ids: Dict[int, int]

	def __init__(self, next_id_func: Optional[Callable[[], int]] = None) -> None:
	self.reserved_ids = {}
	counter = itertools.count(1)
	self.next_id_func = next_id_func if next_id_func is not None else lambda: next(counter)

Arrow: Allow missing field-ids from Schema #183

Arrow: Allow missing field-ids from Schema #183

Conversation

Fokko commented Dec 5, 2023

HonahX left a comment • edited Loading

Choose a reason for hiding this comment

HonahX Dec 7, 2023

Choose a reason for hiding this comment

Fokko Dec 7, 2023

Choose a reason for hiding this comment

Fokko Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

HonahX Dec 8, 2023

Choose a reason for hiding this comment

rdblue Dec 12, 2023

Choose a reason for hiding this comment

HonahX left a comment • edited Loading

Choose a reason for hiding this comment

Fokko commented Dec 9, 2023

HonahX commented Dec 10, 2023

HonahX left a comment

Choose a reason for hiding this comment

sungwy left a comment

Choose a reason for hiding this comment

rdblue commented Dec 12, 2023

HonahX left a comment •

edited

Loading

Fokko Dec 7, 2023 •

edited

Loading

HonahX left a comment •

edited

Loading