Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Puffin: Document stats ndv value representation #10793

Conversation

findepi
Copy link
Member

@findepi findepi commented Jul 27, 2024

Follows discussion #10288 (comment)

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Jul 27, 2024
@findepi findepi requested a review from szehon-ho July 27, 2024 20:42
@findepi
Copy link
Member Author

findepi commented Jul 27, 2024

cc @karuppayya

Copy link
Member

@jbonofre jbonofre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good and useful details to me. Thanks !

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but does this mean an integer value like 2 now becomes 2.0? (if using java toString)

And in any case, as its not entirely backward compat, should we update the theta sketch version again? Maybe can bundle with the other pr of @amogh-jahagirdar : #10549

@@ -121,7 +121,9 @@ distinct values converted to bytes using Iceberg's single-value serialization.

The blob metadata for this blob may include following properties:

- `ndv`: estimate of number of distinct values, derived from the sketch.
- `ndv`: estimate of number of distinct values, derived from the sketch,
stored as non-negative integer value represented using decimal digits
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 'integer' necessary? Not sure if its just me, but does it add confusion (i initially interpret it to mean 'number without decimal point')?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i actually used "integer" in "whole number without decimal point" meaning (and not eg as java integer, which is 32-bit integer value).
what's the best way to say this in English?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I misunderstood the pr, 'decimal' threw me off.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand how words "integer" and "decimal" invite for misunderstanding. what would be a better way to write this?

Copy link
Collaborator

@szehon-ho szehon-ho Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc for Integer says 'decimal representation', maybe that? https://docs.oracle.com/javase/8/docs/api/java/lang/Integer.html#toString-int-

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like "non-negative integer estimate of number of distinct values derived from the sketch, stored as a string using decimal representation"

@findepi
Copy link
Member Author

findepi commented Jul 29, 2024

I am not sure, but does this mean an integer value like 2 now becomes 2.0? (if using java toString)

no, this should be "2"

And in any case, as its not entirely backward compat, should we update the theta sketch version again?

this is supposed to be a clarification, not a change.

@szehon-ho
Copy link
Collaborator

@findepi Got it sorry i misinterpreted this pr to support double as per #10288 (comment) . This pr makes more sense then.

I think we should still do this for completness, but as theta-sketch-v2?

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pr looks good to me, we can have an discuss long->double enhancements in another pr then.

@findepi
Copy link
Member Author

findepi commented Jul 29, 2024

misinterpreted this pr to support double as per #10288 (comment) .

sorry for the confusion!
in that PR review thread i was trying not to prescribe which way we should go and just outline consequences of using doubles
clarifying the spec to imply long values is just one of the ways and this is what this PR is doing.
clarifying the spec to imply fractional values is some other option. (I am slightly less in favor of it, so i didn't create a PR with it.)

I think we should still do this for completness, but as theta-sketch-v2?

The wording used for apache-datasketches-theta-v1 should have been better and clearly define both: allowed values and their representation. I.e. it was not specified what values are allowed and how they are represented.
This PR should not be understood as prohibiting fractional values. Rather, it should be seen as clarification of what's allowed and what's not, and as such shouldn't need a new version of the sketch.

@findepi findepi requested review from rdblue, Fokko and amogh-jahagirdar and removed request for rdblue and amogh-jahagirdar July 29, 2024 21:01
@szehon-ho
Copy link
Collaborator

szehon-ho commented Jul 29, 2024

Yes this pr as is should not require a spec change.

The wording used for apache-datasketches-theta-v1 should have been better and clearly define both: allowed values and their representation. I.e. it was not specified what values are allowed and how they are represented.
This PR should not be understood as prohibiting fractional values. Rather, it should be seen as clarification of what's allowed and what's not, and as such shouldn't need a new version of the sketch.

Sorry I am still confused :( , this pr currently prohibits fractional value in this particular metadata "ndv" doesnt it, with the phrase "integer value"?

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as @szehon-ho , by defining this as an integer we are necessarily saying it cannot be a fractional value.

I actually think that's ok and makes sense (what does it mean to have a fractional value for an NDV), since really this is just a clarification of a field that I don't think we should be worried about since I admittedly find it a difficult to believe if users were writing fractional values for this in the first place.

IMO this is different then the other case where we were proposing making the NDV field required since there it is arguable the structure of the metadata is materially different from the original spec.

@szehon-ho
Copy link
Collaborator

szehon-ho commented Jul 30, 2024

Thanks @amogh-jahagirdar. I guess I need to give the context. In #10288 (comment) we realize that in fact ndv as defined by theta-sketch algorithm and java library is a double, and the fact that we have stored it is a long in Trino/Spark PR means some precision is missing.

I am actually more in favor of making it a double to keep consistent with the algorithm, @findepi mention it is not too significant in the long run and favors keeping it a long. But above all, converting now from long to double in trino side is backward incompatible. Hence, was hoping that we can bundle this together with the bump to v2 in #10549 to allow decimal here.

@@ -121,7 +121,9 @@ distinct values converted to bytes using Iceberg's single-value serialization.

The blob metadata for this blob may include following properties:

- `ndv`: estimate of number of distinct values, derived from the sketch.
- `ndv`: estimate of number of distinct values, derived from the sketch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps providing examples of allowed and not allowed values would also help with the clarification

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe a grammar?

@findepi
Copy link
Member Author

findepi commented Aug 1, 2024

I am actually more in favor of making it a double to keep consistent with the algorithm

@szehon-ho i am totally fine with that approach too. We would need to define the string representation of the double value.
Would you want to shape a PR with appropriate wording?

@findepi findepi changed the title Document stats ndv value representation Puffin: Document stats ndv value representation Oct 3, 2024
@findepi
Copy link
Member Author

findepi commented Oct 3, 2024

It seems I forgot about the PR and the discussion here died. I believe it's better to merge as is than not to merge at all.
@szehon-ho your insights were good, sorry i wasn't able to fully complete them. Will you be able to follow up?

@findepi findepi merged commit 4099a67 into apache:main Oct 3, 2024
2 checks passed
@findepi findepi deleted the findepi/document-stats-ndv-value-representation-a61d1d branch October 3, 2024 10:49
@szehon-ho
Copy link
Collaborator

Sure no problem! Yea ive been a bit busy, let me start a thread on the devlist when i get a chance

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants