Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow: Allow missing field-ids from Schema #183

Closed
wants to merge 4 commits into from

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Dec 5, 2023

No description provided.

@Fokko Fokko force-pushed the fd-make-field-ids-optional branch from e5923dd to 405d36c Compare December 5, 2023 12:43
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify my understanding, I think this fallback is expected to function correctly only if the table schema's field IDs are generated in the same way as in the visitor or assign_fresh_field_id. This should generally hold true, as the official Iceberg API always reassigns schema field IDs when creating a table. Please let me know if I've misunderstood anything here. Thanks!


def __init__(self) -> None:
self.counter = count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use count(1) here since we start from 1 inassign_fresh_schema_ids

class _SetFreshIDs(PreOrderSchemaVisitor[IcebergType]):
"""Traverses the schema and assigns monotonically increasing ids."""
reserved_ids: Dict[int, int]
def __init__(self, next_id_func: Optional[Callable[[], int]] = None) -> None:
self.reserved_ids = {}
counter = itertools.count(1)
self.next_id_func = next_id_func if next_id_func is not None else lambda: next(counter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the limited context here, it will skip the fields if it doesn't have an ID:
image

Which is kind of awkward.

Copy link
Contributor Author

@Fokko Fokko Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of using assign_fresh_schema_ids, since that one is a pre-order, and the default is post-order. I've updated the code, let me know what you think! Appreciate the review!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I happen to have an iceberg table (migrated from delta lake) whose parquet files contain no field-id. With this change, I am now able to use pyiceberg to read its data. This is really great! Out of curiosity, are there any additional use-cases where this PR might be beneficial?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I happen to have an iceberg table (migrated from delta lake) whose parquet files contain no field-id. With this change, I am now able to use pyiceberg to read its data.

There is already a way to assign field IDs when they are not in a data file, using a name mapping. All reads that need to infer field IDs must use a name mapping rather than assigning IDs per data file.

@Fokko Fokko force-pushed the fd-make-field-ids-optional branch from d507fcd to 27017cf Compare December 7, 2023 15:20
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my current understanding, this Pull Request can be a temporal solution for managing parquet files lacking field-ids, until the Name-mapping feature is implemented. Overall, it seems good to me! Just have one suggestion on adding more details in the warning message

Additionally, I want to point out some prerequisites for the field-id assignment process to work when reading tables:

  1. The table_schema must be processed with assign_field_id (or its equivalent function in Java) before being written to the table.
  2. The column order in the parquet file should align with that in the table_schema.

If these pre-reqs are not met, we might encounter errors during column binding and data reading. I think this is the reason that we want Name-mapping finally. Please let me know if there's any aspect of this I might be misunderstanding.

Thanks!

@Fokko
Copy link
Contributor Author

Fokko commented Dec 9, 2023

Thanks for the Java context here! Appreciate it!

The table_schema must be processed with assign_field_id (or its equivalent function in Java) before being written to the table.

This will still be the case, but this aims for when we're just reading a Parquet file (that doesn't have field-id metadata), and turning that dataframe into a table. So this is more on the read than the write side. Currently, if you transform a field, it will just drop the fields which is not what we want.

The column order in the parquet file should align with that in the table_schema.

I agree, we could see if we can have a way to wire up the names. That's a great idea! Could you create an issue on that?

If these pre-reqs are not met, we might encounter errors during column binding and data reading. I think this is the reason that we want Name-mapping finally. Please let me know if there's any aspect of this I might be misunderstanding.

I don't see the link with name-mapping. Name mapping will provide an external mapping of column names to ID. But this is more about writing new data to a new Iceberg table where there are no IDs.

Just thinking out loud from a practical perspective. If you retrain a model, you have results, and you want to overwrite an existing table that has the same column, then you want to wire up those names then we might want to defer the assignment of IDs until we write. I think we can elaborate on this.

@HonahX
Copy link
Contributor

HonahX commented Dec 10, 2023

Thanks for the detailed explanation! I was originally looking at how this PR helps with reading, especially focusing on the code here:

if metadata := physical_schema.metadata:
schema_raw = metadata.get(ICEBERG_SCHEMA)
# TODO: if field_ids are not present, Name Mapping should be implemented to look them up in the table schema,
# see https://github.com/apache/iceberg/issues/7451
file_schema = Schema.model_validate_json(schema_raw) if schema_raw is not None else pyarrow_to_schema(physical_schema)
. That's why I brought up name-mapping.

But now I totally get the write-side importance too. Based on your suggestion, I've opened Issue #199. Would love your thoughts on it or any suggestions you might have!

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well. I was running into an issue reading iceberg tables that had data files side-loaded into the Iceberg Table with add_files migration procedure, and this PR will resolve the issue.

/lib/python3.10/site-packages/pyiceberg/io/pyarrow.py in _get_field_id(field)
    703 def _get_field_id(field: pa.Field) -> Optional[int]:
    704     for pyarrow_field_id_key in PYARROW_FIELD_ID_KEYS:
--> 705         if field_id_str := field.metadata.get(pyarrow_field_id_key):
    706             return int(field_id_str.decode())
    707     return None

AttributeError: 'NoneType' object has no attribute 'get'

@rdblue
Copy link
Contributor

rdblue commented Dec 12, 2023

Per my comment, I'm -1 if the intent of this PR is to read data files without IDs. That must use a name mapping in order to be safe!

@Fokko Fokko added this to the PyIceberg 0.6.0 release milestone Dec 13, 2023
@sungwy sungwy mentioned this pull request Dec 17, 2023
3 tasks
@Fokko Fokko closed this Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants