-
-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It is not reasonable for rand_distr::Zipf
to return floating-point values
#1323
Comments
Sorry for not responding sooner. Since our implementation is built on floating-point arithmetic, a different approach would be required. However...
|
I agree that simply casting the returned I think you may reference the implementation of numpy, which also seems to be built on floating-point algorithm but returns integers: https://github.com/numpy/numpy/blob/3032e84ff34f20def2ef4ebf9f8695947af3fd24/numpy/random/src/distributions/distributions.c#L1000 I don't know how they guarantee that the returned integer won't be out-of-bound, though. There is no boundary check. Maybe math magic? But I think it would be more robust for Update: I just find that the
|
The first possible error is that we convert Then we should choose what output type to support. It may be better to implement Then, as you say, test that the output is in fact less than I can work on this if there are no other takers, but it isn't high on my priority list. |
Let's think twice about this. I don't know much about Zipfian distribution, but in my impression, if
I don't know what |
In this case, resample. Intuitively your suggestion to ignore loss-of-precision-on-creation and to clamp the result makes sense, but if By the way, I notice that output is in the range https://play.rust-lang.org/?version=stable&mode=debug&edition=2021 (So, yes, I think we should follow your suggestion and clamp the output to |
Resample makes sense. The
It's an empty playground. Maybe the playground can not be used to share code?
This looks fine to me.
Can't agree more. |
Sorry, I forgot to click the share button: |
I think the worst case in our discussion is According to Wikipedia, in this case the probability to get , where According to Wikipedia, if Let's assume that due to floating point error, the largest Let's assume that This is approximately the bias that will be added to the possibility of getting
This is still a problem. However, I believe it's better to let the user take the risk instead of deny of service. |
The distribution was implemented in #1136 by @vks with review by @saona-raimundo. There was already justification for the floating-point output. I will leave this up to @vks, suggesting a few possibilities:
I wonder also if we should have a new crate like |
Rejection sampling is always a "possible infinite loop", but this does not matter, because the probability for it being infinite approaches zero. Usually, the probability of having more than a few rejection iterations is extremely small, because the probability to reject So I don't think this works as an argument against rejection sampling. Also, the implementation uses rejection sampling anyway. Numpy also uses rejection sampling make sure the output fits into an integer. In the case of the Zipf distribution, I'm not convinced by the arguments to make it return integers, for the reasons @dhardy linked, and:
The output is already constrained to be smaller than or equal to I would suggest the following changes:
However, I would like to know more about common use cases that motivate 2. before implementing it. |
I think this is better. The strongest motivation for this is when we use the returned value as an index to an array, we need to make sure that the value won't be out of bound. Even if the floating point value returned is mathematically |
I am not against it, and I like that we can give guarantees about the output. "integer" is not a type, and, if we are going to offer it for indexes, then Lastly, the guarantee is with respect to the input to the constructor, so the constructor should accept either |
Fair comments. At this point I believe a PR would be welcome (unless @vks has plans?). |
I'm not sure about supporting lots of integer types. It increases the API surface and code size for little benefit. I think it's preferable to only support |
Lets not worry about breaking changes now. Thus, it sounds like we want to:
|
I agree that just supporting |
And numpy.random.Generator.zipf returns integers:
Therefore, I think it is not reasonable for
rand_distr::Zipf
, a distribution of integers, to return floating-point values. The problem with returning floating-point values is that the floating-point values are not precise, and if we cast the returned floating-point value tousize
and use it as an index then an out-of-bound error may occur due to floating-point error.[1] https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.zipf.html
The text was updated successfully, but these errors were encountered: