-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: Make NDV blob metadata property required #10549
Spec: Make NDV blob metadata property required #10549
Conversation
cc @szehon-ho @singhpk234 @karuppayya I'm also going to send out an email on the mailing list since this is a spec change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make the properties section in the BlobMetadata now as mandatory as well as now this key i,e ndv
would be present and hence properties
I don't think so, since this is particular to the theta sketch blob type. Technically in the future there can be another blob type which doesn't really require any properties? |
sounds fair ! |
@@ -119,7 +119,7 @@ DataSketches](https://datasketches.apache.org/) library. The sketch is obtained | |||
constructing Alpha family sketch with default seed, and feeding it with individual | |||
distinct values converted to bytes using Iceberg's single-value serialization. | |||
|
|||
The blob metadata for this blob may include following properties: | |||
The blob metadata for this blob must include the following properties: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: properties -> property or inline it
Feel free to ignore since it was existing already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let see if there's any concern from mailing list
Hey everyone, sorry for the delay just want to make sure we're following the standard improvement proposal process here: 1.) https://iceberg.apache.org/contribute/#what-is-an-improvement-proposal Going through the above 2 guidelines, this should go through a formal vote on the mailing list before merging since it's a spec change. Also, even though this change can be considered trivial because Puffin is relatively new, Puffin V1 was still technically voted on and agreed upon in the past. Especially since this is an optional->required change, it'd probably be best if this change was made as part of a Puffin V2. In hindsight, I should've thought of this, I mainly thought since it was trivial/not as adopted it'd be OK but we probably don't want to have these 1 off cases in our decision making in the community since then it becomes ambiguous when it's OK or not. What this would mean for the original PR #10288 is that it should respect the spec as is it is today and have some fallback to read the sketch in case the NDV property does not exist. Other engines should do this as well to really respect the spec as it is today. I'm going to send these thoughts on the mailing list discuss thread cc @findepi @karuppayya @singhpk234 @szehon-ho @Fokko @ajantha-bhat |
Puffin fole format has place for versioning within the magic, but the Puffin format doesn't change, only its use changes. Puffin spec is not authoritative source of information of all blob types contained within Puffin files, it's open for extension. And thus, Iceberg may decide that whenever |
@findepi I think I largely agree with you that a Puffin V2 is probably too broad for this since we're not even changing the metadata; we can granularly update the theta sketch definition, since one of the goals for Puffin was to evolve blobs independently without having to change the Puffin spec. I missed that the sketch is actually already versioned with a "v1" in
I don't think I fully agree with this in that, I don't think we can just directly change a blob type without going through an upgrade of the blob type. The spec has defined this blob type I think I'd also advocate for making these granularities on versioning in Puffin more clear in the spec (so that future changes for others can be more smooth). |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Based on discussion stemming from: https://github.com/apache/iceberg/pull/10288/files#discussion_r1622775896
The current consensus is on making the NDV property in blob metadata required for theta sketch blob types. The main advantage is that engines don't have to deserialize the sketch and compute the NDV themselves. Furthermore, there's not really any extra lift to store it in properties from writers; when engines write out the theta sketch, they'll already be able to compute the NDV and store it in properties.