Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Aug 16, 2024

This PR includes the changes to v3 for type promotion. It also includes some changes to introduce unknown and variant types to show what the type promotion language will be for those types.

I'll remove the type additions before merging. We probably want those in separate commits.

@rdblue rdblue requested a review from Fokko August 16, 2024 23:32
@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Aug 16, 2024
format/spec.md Outdated
| `unknown` | | _any type_ | |
| `int` | `long` | `long`, `string` | |
| `long` | | `string` | |
| `date` | | `timestamp`, `timestamp_ns` | Promotion to `timestamptz` or `timestamptz_ns` is **not** allowed |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about long => timestamp/timestamp_ns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wanted to do this (although it is closer to timestamptz) but that hits the problem covered by the next paragraph and table. You can't tell whether a lower/upper bound value was written as a timestamptz or a long. If both values are in microseconds, we'd be fine. But most people store timestamps as long in milliseconds. That's what we were hoping to support with long to timestamp promotion, but it won't work until we have more information about the original type of the bounds.

@rdblue rdblue force-pushed the spec-v3-type-promotion branch from f68559d to da2b721 Compare September 11, 2024 23:07
format/spec.md Outdated

| Added by version | Primitive type | Description | Requirements |
|------------------|--------------------|--------------------------------------------------------------------------|--------------------------------------------------|
| [v3](#version-3) | **`unknown`** | Default / null column type used when a more specific type is not known | Must be optional with `null` defaults |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I missed where this was proposed, do we have a doc anywhere explaining the use case for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for cases where you create a table from example data, but have only seen null values for a column. The column exists (and should be accessible) but the table doesn't know what the type is yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it pay to say this type may not be used for equality deletes, ordering, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. We have well-defined ways to handle null values in those cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

* `int` to `long`
* `float` to `double`
* `decimal(P, S)` to `decimal(P', S)` if `P' > P` -- widen the precision of decimal types.
Iceberg's Avro manifest format does not store the type of lower and upper bounds, and type promotion does not rewrite existing bounds. For example, when a `float` is promoted to `double`, existing data file bounds are encoded as 4 little-endian bytes rather than 8 little-endian bytes for `double`. To correctly decode the value, the original type at the time the file was written must be inferred according to the following table:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this a little, but I want to double check. We aren't allowing any promotion to variant here correct? It would basically be impossible to reverse engineer the original write time metric so I assume we are giving up on that. It could be possible in the future if we change the metric layout.

Copy link
Contributor Author

@rdblue rdblue Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we want to have variant upper and lower bounds, then we cannot have type promotion from primitive types to variant.

format/spec.md Outdated
| `timestamp_ns` | 8 bytes | `timestamp_ns` |
| `decimal(P, S)` | _any_ | `decimal(P', S)`; `P' <= P` |

Type promotion is not allowed for a field that is referenced by `source-id` or `source-ids` of a partition field if the partition transform would produce a different value after promoting the type. For example, `bucket[N]` produces different hash values for `34` and `"34"` (2017239379 != -427558391) but the same value for `34` and `34L`; when an `int` field is the source for a bucket partition field, it may be promoted to `long` but not to `string`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the example here is now not allowed at all since you cannot do a int -> string by definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. There aren't any cases where a type promotion would violate this rule, but I think it is still valuable to add it so that it is clear now and in future versions that even if the type promotion is valid, it is disallowed if used in a transform that would change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented below, and I couldn't be thinking about this incorrectly, but I think bucket fails for date->timestamp since the hash functions are different (date appears to has on day count, and timestamp using microseconds count)

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing to update here is I don't think we want to define equality (or hash) for Variant type so we should exclude Variant from the identity partition transform.

@nastra nastra added this to the Iceberg V3 Spec milestone Sep 25, 2024
@rdblue rdblue force-pushed the spec-v3-type-promotion branch from bd35432 to c8aa57e Compare September 27, 2024 21:22
@rdblue rdblue force-pushed the spec-v3-type-promotion branch from c8aa57e to b23f67b Compare September 27, 2024 21:23
@rdblue
Copy link
Contributor Author

rdblue commented Sep 27, 2024

@aihuaxu, @RussellSpitzer, I've removed variant from this PR so that we can make progress on the type promotion. Now this just includes unknown and the new promotions. We can make progress on variant separately once the Parquet community standardizes it more.

format/spec.md Outdated
| `timestamp_ns` | 8 bytes | `timestamp_ns` |
| `decimal(P, S)` | _any_ | `decimal(P', S)`; `P' <= P` |

Type promotion is not allowed for a field that is referenced by `source-id` or `source-ids` of a partition field if the partition transform would produce a different value after promoting the type. For example, `bucket[N]` produces different hash values for `34` and `"34"` (2017239379 != -427558391) but the same value for `34` and `34L`; when an `int` field is the source for a bucket partition field, it may be promoted to `long` but not to `string`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we analyzed the impact on other places type IDs (i.e. should this be expanded). equality deletes (I think this is OK but there might be edge cases for dropped columns), table level statistics since they use theta sketches.

Also, it might pay to spell out the list of type promotion + transform pairings that are explicitly disallowed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a concern for order by columns if the comparison order of the type promotion changes (for the current set of transforms proposed, I don't think this is a problem).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, there are no allowed type promotions (even after this PR) that violate the rule. This is just for future cases (see my reply to @RussellSpitzer).

This concern has been around since the beginning of the format and is why the hash definition requires that ints and longs produce the same hash value. In that time, I've not come across any other areas that have similar requirements. This is the only case because this is related to the correctness of filtering. Other areas that are similar (like the use of transforms in ordering) are not risky because they are not involved in filtering.

Copy link
Contributor

@emkornfield emkornfield Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't date to timestamp fail for bucket here, I think this necessitates a multiplication which should change the hash value? Isn't is also a problem theta sketch in puffin files for stats, I guess this is a pre-existing problem since single-value serialization is used, so even int->long would cause some amount of inaccuracy?

(sorry for the multiple edits) I think date->timestamps_ns might yield different results depending on how/if overflow for the conversion is handled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue sorry for the yet another comment here but I wanted to address:

This is the only case because this is related to the correctness of filtering. Other areas that are similar (like the use of transforms in ordering) are not risky because they are not involved in filtering.

I'm not sure this is accurate:

  1. For sketches it might be OK to potentially have an estimate that is off by up to 2x but it should probably be clarified. As you point out it wouldn't affect correctness necessarily but could in som cases affect performance. I'll follow-up on the mailing list about this.
  2. If a file is said to be ordered, then I think it would be reasonable for an engine doing a point lookup on an ordered column (e.g. sorted_col_x = 'some_id') to stop scanning after it encountered a value greater then "some_id". So while the ordering doesn't impact metadata filtering, it would violate a contract that some engines might rely on.

The second point falls into the category as not something that is possible with the current set of transforms described in the PR so can probably wait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right that date to timestamp promotion would hit this case. Good catch (and I'm glad I left this in!).

You're also right that this would be a problem for the theta sketch definition. We should put some extra work in there to mitigate problems like this. When a type is promoted, we need to be able to detect and discard the existing sketch and replace it.

For #2, it sounds like the problem would occur when the ordering changes. Or in other words, if the transformation from one type to another is not monotonic. That's probably a good thing to call out as well, or at least look for when we add new type promotion cases. I don't think any of our existing cases would hit that situation, right? We should be careful about changes that alter the order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For #2, it sounds like the problem would occur when the ordering changes. Or in other words, if the transformation from one type to another is not monotonic. That's probably a good thing to call out as well, or at least look for when we add new type promotion cases. I don't think any of our existing cases would hit that situation, right? We should be careful about changes that alter the order.

Correct, but the example given (number->string) is not in fact 'monotonic'. If we remove this example and just enumerate bucket transform is not valid that is probably sufficient for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to change the example? It is demonstrating the point that a bucket function might change even if the value is equivalent. If we change it, then the other example (int to long) no longer makes sense. I think I'd prefer to keep the current examples. I'm not worried about the sorting use case needing to be called out here just because the sort order for that example changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't need to change the example. It might be nice to document the sort order exception as well so it isn't forgotten if current reviewers aren't around and it doesn't come up. I do think for the specification we should explicitly call out which transforms fall into into this category (i.e. add the date->timestamp* for bucket[N] transform so implementors don't need to figure it out themselves)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the case where this is known to happen.


The `initial-default` and `write-default` produce SQL default value behavior, without rewriting data files. SQL default value behavior when a field is added handles all existing rows as though the rows were written with the new field's default value. Default value changes may only affect future records and all known fields are written into data files. Omitting a known field when writing a data file is never allowed. The write default for a field must be written if a field is not supplied to a write. If the write default for a required field is not set, the writer must fail.

All columns of `unknown` type must default to null. Non-null values for `initial-default` or `write-default` are invalid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about how physical columns in datafiles should be handled if they are present. If there is a column with a matching field id and values present, should those be materialized as null as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further down, it states that these columns should be omitted from ORC and Parquet files. Avro can keep the type in the schema, but the serialized format ends up being the same.

Copy link
Contributor

@emkornfield emkornfield Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes sense to update the column projection rules to specify this type shouldn't be matched? At least for old data (i.e. migrated from hive tables), parquet supports UNKNOWN logical types as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? How would this type be matched in column projection? If we're reading a Parquet file with a matching column ID, then it is invalid because you can't write these columns to Parquet. I don't think that we would specifically ignore that case for an unknown column. Maybe it should be an error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue I think one could potentially import parquet files that have an existing column annotated as a logical type "Unknown". I'd expect for these files the inferred schema would be the unknown iceberg type as well. So they could in theory be read from the files (via column name mapping) but we potentially want to skip them, do avoid any complication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a note in the Parquet Appendix that Parquet columns that correspond to unknown columns must be ignored and replaced with null. I think that's the cleanest location for this.

@danielcweeks
Copy link
Contributor

The description still says we're introducing variant, but I assume that's out-of-date and being handled separately?

format/spec.md Outdated
|------------------|------------------------------|------------------------------|--------------|
| `unknown` | | _any type_ | |
| `int` | `long` | `long` | |
| `date` | | `timestamp`, `timestamp_ns` | Promotion to `timestamptz` or `timestamptz_ns` is **not** allowed |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we specify how overflow should be handled for date->timestamp_ns ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should. Should it fail or result in null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this, I'm going to update it to require failure when there is an overflow. That doesn't introduce any issues with required fields, where engines may not be expecting a null value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option would be return the MIN/MAX value as appropriated depending on the which side the date is out of range. I agree returning an error is probably the least likely to cause bugs (I haven't through through all implications of mutating values).

I agree Null's is not viable, in addition to required columns, it potentially breaks ordering since it would conflate overflow in both directions into the same class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

format/spec.md Outdated

| Primitive type | Hash specification | Test value |
|--------------------|-------------------------------------------|--------------------------------------------|
| **`unknown`** | `hashInt(0)` | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this on my prior reviews, but shouldn't unknown be treated as null and return null for any cases that require a hash (also was bucket[N] included to be defined for 'unknown' if not then I think the only other case is puffin files which are today always defined as the hash of the values)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. I'll fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my feedback.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 9, 2024

Thanks for the reviews!

@rdblue rdblue merged commit 67dc9e5 into apache:main Oct 9, 2024
2 checks passed
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants