-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow: Allow missing field-ids from Schema #183
Conversation
e5923dd
to
405d36c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify my understanding, I think this fallback is expected to function correctly only if the table schema's field IDs are generated in the same way as in the visitor or assign_fresh_field_id
. This should generally hold true, as the official Iceberg API always reassigns schema field IDs when creating a table. Please let me know if I've misunderstood anything here. Thanks!
pyiceberg/io/pyarrow.py
Outdated
|
||
def __init__(self) -> None: | ||
self.counter = count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use count(1)
here since we start from 1 inassign_fresh_schema_ids
iceberg-python/pyiceberg/schema.py
Lines 1221 to 1230 in 4616d03
class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]): | |
"""Traverses the schema and assigns monotonically increasing ids.""" | |
reserved_ids: Dict[int, int] | |
def __init__(self, next_id_func: Optional[Callable[[], int]] = None) -> None: | |
self.reserved_ids = {} | |
counter = itertools.count(1) | |
self.next_id_func = next_id_func if next_id_func is not None else lambda: next(counter) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of using assign_fresh_schema_ids
, since that one is a pre-order, and the default is post-order. I've updated the code, let me know what you think! Appreciate the review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I happen to have an iceberg table (migrated from delta lake) whose parquet files contain no field-id. With this change, I am now able to use pyiceberg to read its data. This is really great! Out of curiosity, are there any additional use-cases where this PR might be beneficial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I happen to have an iceberg table (migrated from delta lake) whose parquet files contain no field-id. With this change, I am now able to use pyiceberg to read its data.
There is already a way to assign field IDs when they are not in a data file, using a name mapping. All reads that need to infer field IDs must use a name mapping rather than assigning IDs per data file.
d507fcd
to
27017cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my current understanding, this Pull Request can be a temporal solution for managing parquet files lacking field-ids, until the Name-mapping feature is implemented. Overall, it seems good to me! Just have one suggestion on adding more details in the warning message
Additionally, I want to point out some prerequisites for the field-id assignment process to work when reading tables:
- The
table_schema
must be processed withassign_field_id
(or its equivalent function in Java) before being written to the table. - The column order in the parquet file should align with that in the
table_schema
.
If these pre-reqs are not met, we might encounter errors during column binding and data reading. I think this is the reason that we want Name-mapping finally. Please let me know if there's any aspect of this I might be misunderstanding.
Thanks!
Thanks for the Java context here! Appreciate it!
This will still be the case, but this aims for when we're just reading a Parquet file (that doesn't have field-id metadata), and turning that dataframe into a table. So this is more on the read than the write side. Currently, if you transform a field, it will just drop the fields which is not what we want.
I agree, we could see if we can have a way to wire up the names. That's a great idea! Could you create an issue on that?
I don't see the link with name-mapping. Name mapping will provide an external mapping of column names to ID. But this is more about writing new data to a new Iceberg table where there are no IDs. Just thinking out loud from a practical perspective. If you retrain a model, you have results, and you want to overwrite an existing table that has the same column, then you want to wire up those names then we might want to defer the assignment of IDs until we write. I think we can elaborate on this. |
Thanks for the detailed explanation! I was originally looking at how this PR helps with reading, especially focusing on the code here: iceberg-python/pyiceberg/io/pyarrow.py Lines 827 to 831 in 043aba5
But now I totally get the write-side importance too. Based on your suggestion, I've opened Issue #199. Would love your thoughts on it or any suggestions you might have! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well. I was running into an issue reading iceberg tables that had data files side-loaded into the Iceberg Table with add_files migration procedure, and this PR will resolve the issue.
/lib/python3.10/site-packages/pyiceberg/io/pyarrow.py in _get_field_id(field)
703 def _get_field_id(field: pa.Field) -> Optional[int]:
704 for pyarrow_field_id_key in PYARROW_FIELD_ID_KEYS:
--> 705 if field_id_str := field.metadata.get(pyarrow_field_id_key):
706 return int(field_id_str.decode())
707 return None
AttributeError: 'NoneType' object has no attribute 'get'
Per my comment, I'm -1 if the intent of this PR is to read data files without IDs. That must use a name mapping in order to be safe! |
No description provided.