Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Make NDV blob metadata property required #10549

Conversation

amogh-jahagirdar
Copy link
Contributor

Based on discussion stemming from: https://github.com/apache/iceberg/pull/10288/files#discussion_r1622775896

The current consensus is on making the NDV property in blob metadata required for theta sketch blob types. The main advantage is that engines don't have to deserialize the sketch and compute the NDV themselves. Furthermore, there's not really any extra lift to store it in properties from writers; when engines write out the theta sketch, they'll already be able to compute the NDV and store it in properties.

@amogh-jahagirdar amogh-jahagirdar requested a review from findepi June 21, 2024 21:25
@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Jun 21, 2024
@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Jun 21, 2024

cc @szehon-ho @singhpk234 @karuppayya

I'm also going to send out an email on the mailing list since this is a spec change

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the properties section in the BlobMetadata now as mandatory as well as now this key i,e ndv would be present and hence properties

@amogh-jahagirdar
Copy link
Contributor Author

Should we make the properties section in the BlobMetadata now as mandatory as well as now this key i,e ndv would be present and hence properties

I don't think so, since this is particular to the theta sketch blob type. Technically in the future there can be another blob type which doesn't really require any properties?

@singhpk234
Copy link
Contributor

sounds fair !

@@ -119,7 +119,7 @@ DataSketches](https://datasketches.apache.org/) library. The sketch is obtained
constructing Alpha family sketch with default seed, and feeding it with individual
distinct values converted to bytes using Iceberg's single-value serialization.

The blob metadata for this blob may include following properties:
The blob metadata for this blob must include the following properties:
Copy link
Contributor

@karuppayya karuppayya Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: properties -> property or inline it
Feel free to ignore since it was existing already

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let see if there's any concern from mailing list

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Jul 5, 2024

Hey everyone, sorry for the delay just want to make sure we're following the standard improvement proposal process here:

1.) https://iceberg.apache.org/contribute/#what-is-an-improvement-proposal
2.) https://iceberg.apache.org/contribute/#how-are-proposals-adopted

Going through the above 2 guidelines, this should go through a formal vote on the mailing list before merging since it's a spec change.

Also, even though this change can be considered trivial because Puffin is relatively new, Puffin V1 was still technically voted on and agreed upon in the past. Especially since this is an optional->required change, it'd probably be best if this change was made as part of a Puffin V2. In hindsight, I should've thought of this, I mainly thought since it was trivial/not as adopted it'd be OK but we probably don't want to have these 1 off cases in our decision making in the community since then it becomes ambiguous when it's OK or not.

What this would mean for the original PR #10288 is that it should respect the spec as is it is today and have some fallback to read the sketch in case the NDV property does not exist. Other engines should do this as well to really respect the spec as it is today.

I'm going to send these thoughts on the mailing list discuss thread cc @findepi @karuppayya @singhpk234 @szehon-ho @Fokko @ajantha-bhat

@findepi
Copy link
Member

findepi commented Jul 5, 2024

Puffin fole format has place for versioning within the magic, but the Puffin format doesn't change, only its use changes. Puffin spec is not authoritative source of information of all blob types contained within Puffin files, it's open for extension. And thus, Iceberg may decide that whenever apache-datasketches-theta-v1 blob type is used in Puffin, the blob can be, should be or must be associated with ndv property.

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Jul 5, 2024

@findepi I think I largely agree with you that a Puffin V2 is probably too broad for this since we're not even changing the metadata; we can granularly update the theta sketch definition, since one of the goals for Puffin was to evolve blobs independently without having to change the Puffin spec. I missed that the sketch is actually already versioned with a "v1" in apache-datasketches-theta-v1

And thus, Iceberg may decide that whenever apache-datasketches-theta-v1 blob type is used in Puffin, the blob can be, should be or must be associated with ndv property.

I don't think I fully agree with this in that, I don't think we can just directly change a blob type without going through an upgrade of the blob type. The spec has defined this blob type apache-datasketches-theta-v1, and I consider it as an official type since it's defined with what needs to be stored. I think a apache-datasketches-theta-v2 is more appropriate since we're making an optional field required after the blob type was already voted upon.

I think I'd also advocate for making these granularities on versioning in Puffin more clear in the spec (so that future changes for others can be more smooth).

Copy link

github-actions bot commented Nov 7, 2024

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 7, 2024
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes. stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants