Apply Name mapping #219

sungwy · 2023-12-17T01:21:07Z

Closes: #202

Based on the following two working branches from @Fokko :

Name-mapping plumbing: Add name-mapping #212
Allow missing field-ids from schema: Arrow: Allow missing field-ids from Schema #183

This PR adds _ApplyNameMapping SchemaVisitor that traverses the pyarrow schema and applies the provided name_mapping.
The preference order to pyarrow_to_schema function is:

Use field_ids in file_schema
Use name_mapping (if exists)
Fallback to file column order if neither of above two works

Above order is motivated by the current logic in Spark Iceberg Parquet Read Conf

TODO:

Read and use table property ''schema.name-mapping.default'
A lot more unit test cases to cover edge cases, like: field_id -1
Get more context on identifier_field_ids: should they be ignored when field_ids aren't set in the file_schema?

EDIT: I've added a new utility function new_schema_for_table that can be used by PyIceberg users who want to generate a new PyIceberg Schema from a Arrow Schema to help create a new Iceberg Table.

All the things to (de)serialize the name-mapping, and all the neccessary visitors and such

Bumps [mypy-boto3-glue](https://github.com/youtype/mypy_boto3_builder) from 1.33.5 to 1.34.0. - [Release notes](https://github.com/youtype/mypy_boto3_builder/releases) - [Commits](https://github.com/youtype/mypy_boto3_builder/commits) --- updated-dependencies: - dependency-name: mypy-boto3-glue dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…sy-name-mapping

rdblue · 2023-12-17T20:19:59Z

@syun64, the "fallback to file column order if neither of above two works" is unsafe and cannot be added to the Python implementation. There is some code in the Java implementation that does it but that is specifically for Netflix and dates from when the code was first donated to the ASF.

ID inference by position does not work with nested fields and can easily cause jobs to fail or produce incorrect data. We are in the process of removing it in Java so we don't want to add it here. It was also only ever supported for Parquet, so this would actually expand the problem.

sungwy · 2023-12-18T15:47:30Z

Thank you for the context @rdblue . I will remove the fallback logic from pyarrow_to_schema and ignore setting identifier_field_ids property in _ApplyNameMapping

Bumps [pyarrow](https://github.com/apache/arrow) from 14.0.1 to 14.0.2. - [Commits](apache/arrow@go/v14.0.1...apache-arrow-14.0.2) --- updated-dependencies: - dependency-name: pyarrow dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [moto](https://github.com/getmoto/moto) from 4.2.11 to 4.2.12. - [Release notes](https://github.com/getmoto/moto/releases) - [Changelog](https://github.com/getmoto/moto/blob/master/CHANGELOG.md) - [Commits](getmoto/moto@4.2.11...4.2.12) --- updated-dependencies: - dependency-name: moto dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

sungwy · 2023-12-19T03:37:09Z

Hi folks, I've rebased the branch from main to reflect the changes in Add name-mapping, and added some negative test cases.

I have some questions:

The current proposed approach assumes that if any of the fields present in the pyarrow schema does not have a field ID, it has missing field IDs, and hence would require the table name mapping property to be set. Conversely, the version of ConvertToIceberg visitor on main simply ignores any fields that do not have field_ids in its metadata, and hence is able to read files with partially missing field_ids. Does the proposed behavior sound correct?
In order to achieve above, I am creating NestedFields with field_id=-1 (since there is an int_type constraint on NestedField. Although this feels arbitrary, we could argue that this is reasonable since '-1' is also Parquet's choice of arbitrary int when representing missing field_ids. Example:

<pyarrow._parquet.ParquetSchema object at 0x7f72b317fec0>
required group field_id=-1 schema {
  optional double field_id=-1 one;
  optional binary field_id=-1 two (String);
  optional boolean field_id=-1 three;
  optional binary field_id=-1 __index_level_0__ (String);
}

Does this sound reasonable?

Fokko · 2023-12-19T14:30:53Z

@syun64 thanks again for working on this. When there are no IDs and there is no field-mapping. I think we should fall back to assigning IDs, similar that we do in Java: https://github.com/apache/iceberg/blob/838787e296b502740470ce70f68bb27af4210121/parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java#L225

I see in Java it is a bit more separated with a hasIds visitor, which is nice.

~~This is also what I proposed in #183 Thoughts @rdblue?~~

sungwy · 2023-12-19T19:00:46Z

I've discussed with @Fokko offline regarding how we'd like to handle the edge cases, and here's the summary of the logic that I've implemented in the current version that follows that discussion:

HasIds first checks if ANY of the field IDs are missing in the file.
If HasIds, ConvertToIceberg without name mapping. If any of the fields doesn't have a Type, we throw a ValueError
Else if NameMapping is specified on the table property, ConvertToIceberg with name mapping.
If any of the field IDs present in the file are missing from the NameMapping, ValueError is thrown
If table does not have IDs (1) and name_mapping is also not provided, we throw a ValueError with a specific exception message: https://github.com/apache/iceberg-python/pull/219/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR634

Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.2 to 3.1.3. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](pallets/jinja@3.1.2...3.1.3) --- updated-dependencies: - dependency-name: jinja2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

'PARQUET:' prefix is specific to Parquet, with 'PARQUET:field_id' setting the 'field_id'. Removed the non-prefixed alternative for 'field_id'. Removed the prefixed alternative for 'doc'.

All the things to (de)serialize the name-mapping, and all the neccessary visitors and such

pyiceberg/table/__init__.py

Fokko · 2024-01-15T18:58:35Z

pyiceberg/io/pyarrow.py

+        self._field_names.pop()
+
+
+class _ConvertToIcebergWithFreshIds(PreOrderPyArrowSchemaVisitor[Union[IcebergType, Schema]]):


A visitor should do one thing, and do it well. WDYT of returning to the original ID of setting it as -1 and then assigning fresh IDs using the existing visitor? We're duplicating a lot of code here.

Makes sense @Fokko . Should we have a boolean attribute to drive this logic in _ConvertToIceberg? I think as @rdblue pointed out in this comment, falling back to assigning fresh IDs as a general fallback sounds like a dangerous behavior that we would want to avoid. Having a boolean attribute will make sure that we are only using this feature when creating a new table, instead of avoid falling back when there are no IDs in existing Iceberg tables.

Should we have a boolean flag _ignore_ids which if True makes _ConvertToIceberg ignore the existing ID assigning process, and instead assigns all IDs as -1?

That's why I was suprised to see the option in the Java API to convert a Spark dataframe to an Iceberg schema with the IDs set based on the position. Ideally, we want to have the ability to inject a name-mapping in the _ConvertToIceberg visitor so we don't rely on positions, but on names.

@Fokko right - but that sounds a bit like a chicken and egg problem... because in order to create NameMapping, we need to have a PyIceberg Schema as well. Unless we want to create a new visitor NameMappingVisitorFromArrowSchema to create NameMapping from ArrowSchema, I think the current approach of assigning IDs directly, through ArrowToSchema visitor sounds more straight forward

@Fokko @HonahX I don't think we reached a conclusion here, so I just wanted to confirm that we were on the same page regarding the final step:

Should we have a boolean flag _ignore_ids which if True makes _ConvertToIceberg ignore the existing ID assigning process, and instead assigns all IDs as -1?

Does this sound like the best way to support this function? We first assign the IDs as -1, using _ignore_ids=True on _ConvertToIceberg, and then we use _SetFreshIDs on the PyIceberg Schema to generate the field IDs using pre-order traversal?

Sorry for the late reply. @syun64 Overall I think it is an effective way to reduce code duplication and achieve the same utility. I want to add my findings while trying this approach:

Assigning everything as -1 and then use assign_fresh_schema_ids seems not working out of box. It is because in _SetFreshIds, we use

iceberg-python/pyiceberg/schema.py

Lines 1239 to 1242 in 06e2b2d

def struct(self, struct: StructType, field_results: List[Callable[[], IcebergType]]) -> StructType:

# assign IDs for this struct's fields first

self.reserved_ids.update({field.field_id: self._get_and_increment() for field in struct.fields})

return StructType(*[field() for field in field_results])

a dictionary to keep track of assigned Ids. If our current schema has -1 for everything, the resulting new schema will also has the same id for all the fields in the same level. To resolve this, we need to either update _SetFreshIds visitor or assign distinct ids(-1, -2, ...) when converting pyarrow schema.

I am thinking that if we can still have separate visitors for normal read and ignore_ids case, by extracting common logic to a parent class. Based on my observation, schema, struct, primitive, and field are the same for both cases. If we refactor it to:

class _ConvertToIceberg(PyArrowSchemaVisitor[Union[IcebergType, Schema]], ABC): ... class _ConvertToIcebergWithFieldIds(_ConvertToIceberg): ... class _ConvertToIcebergWithoutIds(_ConvertToIceberg):

we only need to implement list and map for both visitors, accompanied with their distinct ways to get field ids.

In this way, we can still separate the two use cases while maintaining low code duplications. However, this does undermine code readability because we split the visitor logic into two places.

I am raising this primarily because a separate and properly named visitor might emphasize that this is a special case for special usage, so it will not confuse others who inspect these codes for reference. @syun64 @Fokko What do you think of this approach, compared with _ignore_ids boolean flag suggested by @syun64 (which has the least code duplication and is simpler to read).

(I have a draft implementation for 2 in my own repo: https://github.com/HonahX/iceberg-python/blob/de14f5cb8d91a26541356dd3c614bee8c4b8cb8c/pyiceberg/io/pyarrow.py#L805)

By the way, shall we consider moving new_schema_for_table to a separate PR? I think the Apply Name mapping part already contains lots of important changes and is good to go.

Let me share your concern. Currently, we have an API like:

tbl.write(df: pa.Table)

I would say at some point we get something like:

tbl.write(df: pa.Table, merge_schema=True) # actual name TBD, could also be a property

Assuming that the Arrow dataframe doesn't have a schema, we'll use name mapping to set the names and convert it to an Iceberg schema, and that's all safe. So we need to have the ability to apply name-mapping on a PyArrow schema.

It gets dangerous when people start doing:

new_schema = new_schema_for_table(df: pa.Table) with tbl.update_schema() as update: update.union_with_schema(new_schema)

Which seems reasonable to do if you're new to Iceberg. This is the Java equivalent of UnionByNameVisitor. If you have something like:

1: name (str) 2: age (int)

And you add a field:

1: name (str) 2: phonenumber (int) 3: age (int)

Then age will be renamed to phonenumber and a new field with age will be added. Therefore we want to hide this behind an API like we're doing when creating a new table.

I think that @HonahX made a good point about the _SetFreshIds visitor. Interestingly enough, the implementation on the Java side is also different where it does a lookup on the full column name of the baseSchema. This baseSchema is null when creating a new table.

I think the problem here is that we don't have an API like in Spark where we can nicely hide things. I'm almost tempted to allow creating a table from a PyArrow table create_table_from_table(df: pa.Table), but that mixes in PyArrow into the main API, but refrains us from exposing these things to the user (which isn't super user friendly in general). WDYT @syun64 @HonahX ?

I think the problem here is that we don't have an API like in Spark where we can nicely hide things. I'm almost tempted to allow creating a table from a PyArrow table create_table_from_table(df: pa.Table), but that mixes in PyArrow into the main API, but refrains us from exposing these things to the user (which isn't super user friendly in general). WDYT @syun64 @HonahX ?

I'm in agreement with this idea. I see three ways a user would want to create an Iceberg table:

Completely manual - by specifying the schema, field by field

By inferring the schema from an existing strongly-typed file or pyarrow table

By copying the schema of an existing iceberg table (migration)

Since we are only concerned with the schema, and not the data: what are your thoughts in using the pyarrow schema (instead of pyarrow table) as the input for this function?

Assuming that the Arrow dataframe doesn't have a schema, we'll use name mapping to set the names and convert it to an Iceberg schema, and that's all safe. So we need to have the ability to apply name-mapping on a PyArrow schema.

Sounds good @Fokko . Since this PR already introduces the ability to apply name-mapping onto a PyArrow Schema and create a pyiceberg.Schema, if this is the approach we'd like to take, we would need the ability to generate name-mapping from a PyArrow Schema with no IDs. This is different from existing _CreateMapping which creates name mapping based on an existing pyiceberg Schema which already have IDs assigned.

class _ConvertToIceberg(PyArrowSchemaVisitor[Union[IcebergType, Schema]], ABC):
...
class _ConvertToIcebergWithFieldIds(_ConvertToIceberg):
...
class _ConvertToIcebergWithoutIds(_ConvertToIceberg):

One thing I wanted to note, is that the task of assigning fresh IDs to a schema needs to be a pre-order visitor, instead of post-order like _ConvertToIceberg or _CreateMapping. This ensures that the field_id is assigned to the field before they are assigned to the element, key or values. I think that would prevent us from having the two visitors inherit from the same parent class.

By the way, shall we consider moving new_schema_for_table to a separate PR? I think the Apply Name mapping part already contains lots of important changes and is good to go.

Great suggestion @HonahX - at first I thought it would be a small lift in this PR to add it in, but it seems clear that there's a lot more to be discussed on the topic. I've opened this issue, and I'll reduce the scope of this PR to just Apply Name Mapping.

Let's continue the discussion on the dedicated issue for the topic.

Co-authored-by: Fokko Driesprong <fokko@apache.org>

HonahX · 2024-01-18T21:28:20Z

Thanks for the great work!

sungwy · 2024-01-18T21:36:29Z

Need help merging this in as well :)

HonahX · 2024-01-19T01:29:17Z

All reviews related to "Apply Name Mapping" are resolved. Let's get this in and continue our discussion in #278 😊. Thanks @syun64

Fokko and others added 11 commits December 5, 2023 13:43

Arrow: Allow missing field-ids from Schema

405d36c

Thanks Honah!

27017cf

Update pyiceberg/io/pyarrow.py

2657fe1

lint

447d22f

Add name-mapping

0a0e829

All the things to (de)serialize the name-mapping, and all the neccessary visitors and such

Move the names from a set to a list

5a673d0

Move from set to lint in tests as well

2c9be7c

make tests happy

c13e3b3

Merge remote-tracking branch 'fokko/fd-make-field-ids-optional' into …

b63043e

…sy-name-mapping

apply name mapping to file tasks

9dfdc05

sungwy and others added 11 commits December 18, 2023 22:50

load name mapping from metadata

c918dcf

Build: Bump actions/upload-artifact from 3 to 4 (apache#215)

7f51df5

Build: Bump coverage from 7.3.2 to 7.3.3 (apache#214)

48fa520

Add name-mapping (apache#212)

199fb85

Move from set to lint in tests as well

473b17d

make tests happy

6466205

apply name mapping to file tasks

074d23c

rebase

2eccbd5

Merge branch 'main' into sy-name-mapping

9ed7511

sungwy marked this pull request as ready for review December 19, 2023 03:20

refactoring

54bcf5e

remove stale ApplyNameMapping class

b6d06cf

dependabot bot and others added 14 commits January 13, 2024 03:36

Arrow: Set field-id with prefix (apache#227)

86bf014

'PARQUET:' prefix is specific to Parquet, with 'PARQUET:field_id' setting the 'field_id'. Removed the non-prefixed alternative for 'field_id'. Removed the prefixed alternative for 'doc'.

Add name-mapping

65dc566

All the things to (de)serialize the name-mapping, and all the neccessary visitors and such

Arrow: Allow missing field-ids from Schema

318da04

apply name mapping to file tasks

abcd8b6

Add name-mapping (apache#212)

23e510a

rebase

73a5c11

refactoring

27f0472

rebase

263d0e3

adopt review comments

4358c7c

rebase

3bec604

Merge branch 'main' into sy-name-mapping

8897ef6

use constant

30e34c6

add utility function new_schema_for_table

f7d537d

sungwy changed the title ~~Apply Name mapping~~ Apply Name mapping, new_schema_for_table Jan 14, 2024

Fokko reviewed Jan 15, 2024

View reviewed changes

Update pyiceberg/table/__init__.py

49055c5

Co-authored-by: Fokko Driesprong <fokko@apache.org>

sungwy mentioned this pull request Jan 18, 2024

Create Iceberg Table from pyarrow Schema with no IDs #278

Closed

Merge branch 'main' into sy-name-mapping

7a2cef0

sungwy changed the title ~~Apply Name mapping, new_schema_for_table~~ Apply Name mapping Jan 18, 2024

sungwy added 2 commits January 18, 2024 19:49

remove new_schema_for_table

2931b11

Merge branch 'apache:main' into sy-name-mapping

3ac65d7

sungwy requested review from Fokko and HonahX January 18, 2024 21:12

HonahX approved these changes Jan 18, 2024

View reviewed changes

HonahX merged commit 70972d9 into apache:main Jan 19, 2024
6 checks passed

HonahX mentioned this pull request Jan 26, 2024

create_table with a PyArrow Schema #305

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply Name mapping #219

Apply Name mapping #219

sungwy commented Dec 17, 2023 •

edited

Loading

rdblue commented Dec 17, 2023 •

edited

Loading

sungwy commented Dec 18, 2023

sungwy commented Dec 19, 2023

Fokko commented Dec 19, 2023 •

edited

Loading

sungwy commented Dec 19, 2023 •

edited

Loading

Fokko Jan 15, 2024

sungwy Jan 16, 2024

Fokko Jan 16, 2024

sungwy Jan 16, 2024 •

edited

Loading

sungwy Jan 17, 2024

HonahX Jan 18, 2024

HonahX Jan 18, 2024

Fokko Jan 18, 2024

sungwy Jan 18, 2024

sungwy Jan 18, 2024

HonahX commented Jan 18, 2024

sungwy commented Jan 18, 2024

HonahX commented Jan 19, 2024

		self._field_names.pop()


		class _ConvertToIcebergWithFreshIds(PreOrderPyArrowSchemaVisitor[Union[IcebergType, Schema]]):

	def struct(self, struct: StructType, field_results: List[Callable[[], IcebergType]]) -> StructType:
	# assign IDs for this struct's fields first
	self.reserved_ids.update({field.field_id: self._get_and_increment() for field in struct.fields})
	return StructType(*[field() for field in field_results])

Apply Name mapping #219

Apply Name mapping #219

Conversation

sungwy commented Dec 17, 2023 • edited Loading

rdblue commented Dec 17, 2023 • edited Loading

sungwy commented Dec 18, 2023

sungwy commented Dec 19, 2023

Fokko commented Dec 19, 2023 • edited Loading

sungwy commented Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungwy Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX commented Jan 18, 2024

sungwy commented Jan 18, 2024

HonahX commented Jan 19, 2024

sungwy commented Dec 17, 2023 •

edited

Loading

rdblue commented Dec 17, 2023 •

edited

Loading

Fokko commented Dec 19, 2023 •

edited

Loading

sungwy commented Dec 19, 2023 •

edited

Loading

sungwy Jan 16, 2024 •

edited

Loading