-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Puffin: Document stats ndv
value representation
#10793
Puffin: Document stats ndv
value representation
#10793
Conversation
cc @karuppayya |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good and useful details to me. Thanks !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but does this mean an integer value like 2 now becomes 2.0? (if using java toString)
And in any case, as its not entirely backward compat, should we update the theta sketch version again? Maybe can bundle with the other pr of @amogh-jahagirdar : #10549
@@ -121,7 +121,9 @@ distinct values converted to bytes using Iceberg's single-value serialization. | |||
|
|||
The blob metadata for this blob may include following properties: | |||
|
|||
- `ndv`: estimate of number of distinct values, derived from the sketch. | |||
- `ndv`: estimate of number of distinct values, derived from the sketch, | |||
stored as non-negative integer value represented using decimal digits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 'integer' necessary? Not sure if its just me, but does it add confusion (i initially interpret it to mean 'number without decimal point')?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i actually used "integer" in "whole number without decimal point" meaning (and not eg as java integer, which is 32-bit integer value).
what's the best way to say this in English?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I misunderstood the pr, 'decimal' threw me off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand how words "integer" and "decimal" invite for misunderstanding. what would be a better way to write this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadoc for Integer says 'decimal representation', maybe that? https://docs.oracle.com/javase/8/docs/api/java/lang/Integer.html#toString-int-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like "non-negative integer estimate of number of distinct values derived from the sketch, stored as a string using decimal representation"
no, this should be "2"
this is supposed to be a clarification, not a change. |
@findepi Got it sorry i misinterpreted this pr to support double as per #10288 (comment) . This pr makes more sense then. I think we should still do this for completness, but as theta-sketch-v2? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pr looks good to me, we can have an discuss long->double enhancements in another pr then.
sorry for the confusion!
The wording used for |
Yes this pr as is should not require a spec change.
Sorry I am still confused :( , this pr currently prohibits fractional value in this particular metadata "ndv" doesnt it, with the phrase "integer value"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as @szehon-ho , by defining this as an integer we are necessarily saying it cannot be a fractional value.
I actually think that's ok and makes sense (what does it mean to have a fractional value for an NDV), since really this is just a clarification of a field that I don't think we should be worried about since I admittedly find it a difficult to believe if users were writing fractional values for this in the first place.
IMO this is different then the other case where we were proposing making the NDV field required since there it is arguable the structure of the metadata is materially different from the original spec.
Thanks @amogh-jahagirdar. I guess I need to give the context. In #10288 (comment) we realize that in fact ndv as defined by theta-sketch algorithm and java library is a double, and the fact that we have stored it is a long in Trino/Spark PR means some precision is missing. I am actually more in favor of making it a double to keep consistent with the algorithm, @findepi mention it is not too significant in the long run and favors keeping it a long. But above all, converting now from long to double in trino side is backward incompatible. Hence, was hoping that we can bundle this together with the bump to v2 in #10549 to allow decimal here. |
@@ -121,7 +121,9 @@ distinct values converted to bytes using Iceberg's single-value serialization. | |||
|
|||
The blob metadata for this blob may include following properties: | |||
|
|||
- `ndv`: estimate of number of distinct values, derived from the sketch. | |||
- `ndv`: estimate of number of distinct values, derived from the sketch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps providing examples of allowed and not allowed values would also help with the clarification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe a grammar?
@szehon-ho i am totally fine with that approach too. We would need to define the string representation of the double value. |
ndv
value representationndv
value representation
It seems I forgot about the PR and the discussion here died. I believe it's better to merge as is than not to merge at all. |
Sure no problem! Yea ive been a bit busy, let me start a thread on the devlist when i get a chance |
Follows discussion #10288 (comment)