-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spec: Add v3 types and type promotion #10955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
format/spec.md
Outdated
| | `unknown` | | _any type_ | | | ||
| | `int` | `long` | `long`, `string` | | | ||
| | `long` | | `string` | | | ||
| | `date` | | `timestamp`, `timestamp_ns` | Promotion to `timestamptz` or `timestamptz_ns` is **not** allowed | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about long => timestamp/timestamp_ns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wanted to do this (although it is closer to timestamptz) but that hits the problem covered by the next paragraph and table. You can't tell whether a lower/upper bound value was written as a timestamptz or a long. If both values are in microseconds, we'd be fine. But most people store timestamps as long in milliseconds. That's what we were hoping to support with long to timestamp promotion, but it won't work until we have more information about the original type of the bounds.
f68559d to
da2b721
Compare
format/spec.md
Outdated
|
|
||
| | Added by version | Primitive type | Description | Requirements | | ||
| |------------------|--------------------|--------------------------------------------------------------------------|--------------------------------------------------| | ||
| | [v3](#version-3) | **`unknown`** | Default / null column type used when a more specific type is not known | Must be optional with `null` defaults | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I missed where this was proposed, do we have a doc anywhere explaining the use case for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for cases where you create a table from example data, but have only seen null values for a column. The column exists (and should be accessible) but the table doesn't know what the type is yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it pay to say this type may not be used for equality deletes, ordering, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. We have well-defined ways to handle null values in those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
| * `int` to `long` | ||
| * `float` to `double` | ||
| * `decimal(P, S)` to `decimal(P', S)` if `P' > P` -- widen the precision of decimal types. | ||
| Iceberg's Avro manifest format does not store the type of lower and upper bounds, and type promotion does not rewrite existing bounds. For example, when a `float` is promoted to `double`, existing data file bounds are encoded as 4 little-endian bytes rather than 8 little-endian bytes for `double`. To correctly decode the value, the original type at the time the file was written must be inferred according to the following table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this a little, but I want to double check. We aren't allowing any promotion to variant here correct? It would basically be impossible to reverse engineer the original write time metric so I assume we are giving up on that. It could be possible in the future if we change the metric layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we want to have variant upper and lower bounds, then we cannot have type promotion from primitive types to variant.
format/spec.md
Outdated
| | `timestamp_ns` | 8 bytes | `timestamp_ns` | | ||
| | `decimal(P, S)` | _any_ | `decimal(P', S)`; `P' <= P` | | ||
|
|
||
| Type promotion is not allowed for a field that is referenced by `source-id` or `source-ids` of a partition field if the partition transform would produce a different value after promoting the type. For example, `bucket[N]` produces different hash values for `34` and `"34"` (2017239379 != -427558391) but the same value for `34` and `34L`; when an `int` field is the source for a bucket partition field, it may be promoted to `long` but not to `string`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the example here is now not allowed at all since you cannot do a int -> string by definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. There aren't any cases where a type promotion would violate this rule, but I think it is still valuable to add it so that it is clear now and in future versions that even if the type promotion is valid, it is disallowed if used in a transform that would change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented below, and I couldn't be thinking about this incorrectly, but I think bucket fails for date->timestamp since the hash functions are different (date appears to has on day count, and timestamp using microseconds count)
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thing to update here is I don't think we want to define equality (or hash) for Variant type so we should exclude Variant from the identity partition transform.
bd35432 to
c8aa57e
Compare
c8aa57e to
b23f67b
Compare
|
@aihuaxu, @RussellSpitzer, I've removed variant from this PR so that we can make progress on the type promotion. Now this just includes |
format/spec.md
Outdated
| | `timestamp_ns` | 8 bytes | `timestamp_ns` | | ||
| | `decimal(P, S)` | _any_ | `decimal(P', S)`; `P' <= P` | | ||
|
|
||
| Type promotion is not allowed for a field that is referenced by `source-id` or `source-ids` of a partition field if the partition transform would produce a different value after promoting the type. For example, `bucket[N]` produces different hash values for `34` and `"34"` (2017239379 != -427558391) but the same value for `34` and `34L`; when an `int` field is the source for a bucket partition field, it may be promoted to `long` but not to `string`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we analyzed the impact on other places type IDs (i.e. should this be expanded). equality deletes (I think this is OK but there might be edge cases for dropped columns), table level statistics since they use theta sketches.
Also, it might pay to spell out the list of type promotion + transform pairings that are explicitly disallowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a concern for order by columns if the comparison order of the type promotion changes (for the current set of transforms proposed, I don't think this is a problem).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, there are no allowed type promotions (even after this PR) that violate the rule. This is just for future cases (see my reply to @RussellSpitzer).
This concern has been around since the beginning of the format and is why the hash definition requires that ints and longs produce the same hash value. In that time, I've not come across any other areas that have similar requirements. This is the only case because this is related to the correctness of filtering. Other areas that are similar (like the use of transforms in ordering) are not risky because they are not involved in filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't date to timestamp fail for bucket here, I think this necessitates a multiplication which should change the hash value? Isn't is also a problem theta sketch in puffin files for stats, I guess this is a pre-existing problem since single-value serialization is used, so even int->long would cause some amount of inaccuracy?
(sorry for the multiple edits) I think date->timestamps_ns might yield different results depending on how/if overflow for the conversion is handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue sorry for the yet another comment here but I wanted to address:
This is the only case because this is related to the correctness of filtering. Other areas that are similar (like the use of transforms in ordering) are not risky because they are not involved in filtering.
I'm not sure this is accurate:
- For sketches it might be OK to potentially have an estimate that is off by up to 2x but it should probably be clarified. As you point out it wouldn't affect correctness necessarily but could in som cases affect performance. I'll follow-up on the mailing list about this.
- If a file is said to be ordered, then I think it would be reasonable for an engine doing a point lookup on an ordered column (e.g.
sorted_col_x = 'some_id') to stop scanning after it encountered a value greater then "some_id". So while the ordering doesn't impact metadata filtering, it would violate a contract that some engines might rely on.
The second point falls into the category as not something that is possible with the current set of transforms described in the PR so can probably wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right that date to timestamp promotion would hit this case. Good catch (and I'm glad I left this in!).
You're also right that this would be a problem for the theta sketch definition. We should put some extra work in there to mitigate problems like this. When a type is promoted, we need to be able to detect and discard the existing sketch and replace it.
For #2, it sounds like the problem would occur when the ordering changes. Or in other words, if the transformation from one type to another is not monotonic. That's probably a good thing to call out as well, or at least look for when we add new type promotion cases. I don't think any of our existing cases would hit that situation, right? We should be careful about changes that alter the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For #2, it sounds like the problem would occur when the ordering changes. Or in other words, if the transformation from one type to another is not monotonic. That's probably a good thing to call out as well, or at least look for when we add new type promotion cases. I don't think any of our existing cases would hit that situation, right? We should be careful about changes that alter the order.
Correct, but the example given (number->string) is not in fact 'monotonic'. If we remove this example and just enumerate bucket transform is not valid that is probably sufficient for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to change the example? It is demonstrating the point that a bucket function might change even if the value is equivalent. If we change it, then the other example (int to long) no longer makes sense. I think I'd prefer to keep the current examples. I'm not worried about the sorting use case needing to be called out here just because the sort order for that example changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we don't need to change the example. It might be nice to document the sort order exception as well so it isn't forgotten if current reviewers aren't around and it doesn't come up. I do think for the specification we should explicitly call out which transforms fall into into this category (i.e. add the date->timestamp* for bucket[N] transform so implementors don't need to figure it out themselves)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the case where this is known to happen.
|
|
||
| The `initial-default` and `write-default` produce SQL default value behavior, without rewriting data files. SQL default value behavior when a field is added handles all existing rows as though the rows were written with the new field's default value. Default value changes may only affect future records and all known fields are written into data files. Omitting a known field when writing a data file is never allowed. The write default for a field must be written if a field is not supplied to a write. If the write default for a required field is not set, the writer must fail. | ||
|
|
||
| All columns of `unknown` type must default to null. Non-null values for `initial-default` or `write-default` are invalid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about how physical columns in datafiles should be handled if they are present. If there is a column with a matching field id and values present, should those be materialized as null as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further down, it states that these columns should be omitted from ORC and Parquet files. Avro can keep the type in the schema, but the serialized format ends up being the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably makes sense to update the column projection rules to specify this type shouldn't be matched? At least for old data (i.e. migrated from hive tables), parquet supports UNKNOWN logical types as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? How would this type be matched in column projection? If we're reading a Parquet file with a matching column ID, then it is invalid because you can't write these columns to Parquet. I don't think that we would specifically ignore that case for an unknown column. Maybe it should be an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue I think one could potentially import parquet files that have an existing column annotated as a logical type "Unknown". I'd expect for these files the inferred schema would be the unknown iceberg type as well. So they could in theory be read from the files (via column name mapping) but we potentially want to skip them, do avoid any complication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a note in the Parquet Appendix that Parquet columns that correspond to unknown columns must be ignored and replaced with null. I think that's the cleanest location for this.
|
The description still says we're introducing variant, but I assume that's out-of-date and being handled separately? |
format/spec.md
Outdated
| |------------------|------------------------------|------------------------------|--------------| | ||
| | `unknown` | | _any type_ | | | ||
| | `int` | `long` | `long` | | | ||
| | `date` | | `timestamp`, `timestamp_ns` | Promotion to `timestamptz` or `timestamptz_ns` is **not** allowed | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we specify how overflow should be handled for date->timestamp_ns ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should. Should it fail or result in null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this, I'm going to update it to require failure when there is an overflow. That doesn't introduce any issues with required fields, where engines may not be expecting a null value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option would be return the MIN/MAX value as appropriated depending on the which side the date is out of range. I agree returning an error is probably the least likely to cause bugs (I haven't through through all implications of mutating values).
I agree Null's is not viable, in addition to required columns, it potentially breaks ordering since it would conflate overflow in both directions into the same class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
format/spec.md
Outdated
|
|
||
| | Primitive type | Hash specification | Test value | | ||
| |--------------------|-------------------------------------------|--------------------------------------------| | ||
| | **`unknown`** | `hashInt(0)` | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this on my prior reviews, but shouldn't unknown be treated as null and return null for any cases that require a hash (also was bucket[N] included to be defined for 'unknown' if not then I think the only other case is puffin files which are today always defined as the hash of the values)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my feedback.
|
Thanks for the reviews! |
This PR includes the changes to v3 for type promotion. It also includes some changes to introduce
unknownandvarianttypes to show what the type promotion language will be for those types.I'll remove the type additions before merging. We probably want those in separate commits.