Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It is not reasonable for rand_distr::Zipf to return floating-point values #1323

Closed
seekstar opened this issue Jun 28, 2023 · 16 comments · Fixed by #1517
Closed

It is not reasonable for rand_distr::Zipf to return floating-point values #1323

seekstar opened this issue Jun 28, 2023 · 16 comments · Fixed by #1517
Assignees
Labels
E-easy Participation: easy job

Comments

@seekstar
Copy link

seekstar commented Jun 28, 2023

The Zipf distribution (also known as the zeta distribution) is a discrete probability distribution that satisfies Zipf’s law: the frequency of an item is inversely proportional to its rank in a frequency table [1].

And numpy.random.Generator.zipf returns integers:

import numpy as np
type(np.random.default_rng().zipf(4.0))
<class 'int'>

Therefore, I think it is not reasonable for rand_distr::Zipf, a distribution of integers, to return floating-point values. The problem with returning floating-point values is that the floating-point values are not precise, and if we cast the returned floating-point value to usize and use it as an index then an out-of-bound error may occur due to floating-point error.

[1] https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.zipf.html

@dhardy
Copy link
Member

dhardy commented Aug 21, 2023

Sorry for not responding sooner. Since our implementation is built on floating-point arithmetic, a different approach would be required. However...

f64 exactly represents integers up to 2^53, which should be larger than any list index you will ever need. The Zipf implementation may produce precision errors sooner than this (I didn't check), but should still be precise enough for indexing any reasonable sized list.

@seekstar
Copy link
Author

seekstar commented Aug 23, 2023

I agree that simply casting the returned f64 to usize won't cause any problems in most cases. But I think it is better for this crate to handle the floating-point error properly and return precise integers to users, so that users do not have to worry about the floating-point error.

I think you may reference the implementation of numpy, which also seems to be built on floating-point algorithm but returns integers: https://github.com/numpy/numpy/blob/3032e84ff34f20def2ef4ebf9f8695947af3fd24/numpy/random/src/distributions/distributions.c#L1000

I don't know how they guarantee that the returned integer won't be out-of-bound, though. There is no boundary check. Maybe math magic? But I think it would be more robust for rand_distr to do a boundary check before returning because an out-of-bound error typically makes the program panic.

Update: I just find that the numpy implementation does not have an n parameter like this crate. So it only needs to guarantee that the returned integer fits into the integer type: https://github.com/numpy/numpy/blob/3032e84ff34f20def2ef4ebf9f8695947af3fd24/numpy/random/src/distributions/distributions.c#L1017

rand_distr::Zipf has an n parameter, so I think it should do a boundary check against n.

@dhardy
Copy link
Member

dhardy commented Aug 24, 2023

The first possible error is that we convert n: u64 to F via NumCast::from. Probably we should return an error in cases of "precision loss or truncation". We could test by converting back to u64 or using another library (as is not enough on its own).

Then we should choose what output type to support. It may be better to implement Distribution<N> for N in u32, u64, usize (we would need some trait bound, maybe PrimInt). We could replace F with f64 if required but it may not be a problem to keep F: Float.

Then, as you say, test that the output is in fact less than n. If not, I think we can just loop (theoretically this might bias the result, but such bias should not be significant relative to the precision of F).

I can work on this if there are no other takers, but it isn't high on my priority list.

@seekstar
Copy link
Author

The first possible error is that we convert n: u64 to F via NumCast::from. Probably we should return an error in cases of "precision loss or truncation".

Let's think twice about this. I don't know much about Zipfian distribution, but in my impression, if n gets really large, the probability of getting a large number is very small. So I think even if converting a large u64 to f64 is lossy, if we can guarantee the preciseness of the probability distribution of small numbers, we do not have to return an error.

Then, as you say, test that the output is in fact less than n. If not, I think we can just loop (theoretically this might bias the result, but such bias should not be significant relative to the precision of F).

I don't know what loop actually means, but I think it's fine to make the output n if it > n. A large output when n is large is not common anyway.

@dhardy
Copy link
Member

dhardy commented Aug 25, 2023

I don't know what loop actually means

In this case, resample.

Intuitively your suggestion to ignore loss-of-precision-on-creation and to clamp the result makes sense, but if n is large enough not to be exactly representable by the floating-point format then (excepting for one specific value of n) there will be at least one value less than n which can never be sampled. Also, n will be sampled too often. But, given how we're talking about the extreme tail end of a distribution probably none of this bias matters.

By the way, I notice that output is in the range 1 ..= n (inclusive of n) unlike what a lot of programmers are used to (0 .. n). Possibly this should be clarified in the docs.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021

(So, yes, I think we should follow your suggestion and clamp the output to n. But also my previous suggestion of making this generic over n: N so that the output can be a usize.)

@seekstar
Copy link
Author

Resample makes sense. The numpy implementation also resamples if the result is out of bounds. But resample theoretically introduces the possibilities of infinite looping, which might limit its use case. So I think clamping the output to n is better. As you said: "Given how we're talking about the extreme tail end of a distribution probably none of this bias matters".

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021

It's an empty playground. Maybe the playground can not be used to share code?

But also my previous suggestion of making this generic over n: N so that the output can be a size.

This looks fine to me.

By the way, I notice that output is in the range 1 ..= n (inclusive of n) unlike what a lot of programmers are used to (0 .. n). Possibly this should be clarified in the docs.

Can't agree more.

@dhardy
Copy link
Member

dhardy commented Aug 27, 2023

@seekstar
Copy link
Author

I think the worst case in our discussion is s = 1.0 instead of 2.0. Let's estimate how much bias there would be if we clamp the output to n.

According to Wikipedia, in this case the probability to get k is:

$$f(k, n) = \frac{1}{H_n} \frac{1}{k}$$

, where

$$H_n = \sum_{k=1}^n \frac{1}{k}$$

According to Wikipedia, if n gets large, $H_n$ will converge to $\ln n + \gamma, \gamma = 0.5772156649 \cdots$

Let's assume that due to floating point error, the largest $(1 - \alpha) n$ possible outputs are all mapped to a number > n and all clamped to n. Then the possibility of getting those $(1 - \alpha) n$ outputs is:

$$\sum_{k=(1 - \alpha) n+1}^n \frac{1}{H_n} \frac{1}{k} = \frac{1}{H_n} \sum_{k=(1 - \alpha) n+1}^n \frac{1}{k} = \frac{1}{H_n} (H_n - H_{(1 - \alpha)n}) \approx \frac{1}{\ln n + \gamma} (\ln n - \ln (1 - \alpha) n) = - \frac{1}{\ln n + \gamma} \ln (1 - \alpha) $$

Let's assume that $\alpha$ is small. Then $\ln (1 - \alpha) \approx -\alpha$, and the above possibility is approximately:

$$\frac{\alpha}{\ln n + \gamma}, \gamma = 0.5772156649 \cdots $$

This is approximately the bias that will be added to the possibility of getting n. I don't know whether it is acceptable, but resampling can definitely solve this problem. Although personally I don't like the idea of introducing a possible infinite loop, I have to admit that in this case, resampling is a more conservative idea.

but if n is large enough not to be exactly representable by the floating-point format then (excepting for one specific value of n) there will be at least one value less than n which can never be sampled.

This is still a problem. However, I believe it's better to let the user take the risk instead of deny of service.

@dhardy
Copy link
Member

dhardy commented May 22, 2024

The distribution was implemented in #1136 by @vks with review by @saona-raimundo. There was already justification for the floating-point output.

I will leave this up to @vks, suggesting a few possibilities:

  1. No change, or just documentation.
  2. Also implement Distribution<u64> for Zipf<F>. This results in a conflicting implementation error since "upstream crates may add a new impl of trait num_traits::Float for type u64 in future versions". If we use a local Float trait this conflict is not a problem. It could lead to type inference errors in usage (since two parametrisations of Distribution are implemented). It also forgoes the possibility of asserting that the output is constrained by n.
  3. Add a wrapping type, IntegralZipf, which stores the input n and constrains (or asserts constraint of) the output.

I wonder also if we should have a new crate like rand_discrete for discrete distributions?

@vks
Copy link
Collaborator

vks commented May 22, 2024

Rejection sampling is always a "possible infinite loop", but this does not matter, because the probability for it being infinite approaches zero. Usually, the probability of having more than a few rejection iterations is extremely small, because the probability to reject $n$ times is decreasing exponentially as $p^n$.

So I don't think this works as an argument against rejection sampling. Also, the implementation uses rejection sampling anyway. Numpy also uses rejection sampling make sure the output fits into an integer.

In the case of the Zipf distribution, $n$ can be large, possibly larger than $2^{64}$ (for $n \to \infty$ it converges to the Zeta distribution). Therefore, I would rather consider making n a float (of type F), to which it is converted anyway. (However, that's a breaking change.)

I'm not convinced by the arguments to make it return integers, for the reasons @dhardy linked, and:

  • The implementation uses floats internally, converting the output to integers is IMHO mostly cosmetic, lossy, and potentially limiting (if we changed n to be a float).
  • Mathematically, integers are unbound, unlike u64 or f64. f64 can be a better approximation for large integers ($\gg 2^{64}$).

The output is already constrained to be smaller than or equal to n by construction of the sampling function. It might be possible that this invariant is broken by rounding errors, we could do some fuzzing to check, but I don't expect it to be a problem.

I would suggest the following changes:

  1. Document that the samples are <= n. This is already suggested by the docs, but could be more explicit.
  2. Consider offering two Zipf implementations (a breaking change):
    1. n and the returned samples are floats.
    2. n and the returned samples are integers.

However, I would like to know more about common use cases that motivate 2. before implementing it.

@seekstar
Copy link
Author

Consider offering two Zipf implementations (a breaking change):

n and the returned samples are floats.
n and the returned samples are integers.

I think this is better. The strongest motivation for this is when we use the returned value as an index to an array, we need to make sure that the value won't be out of bound. Even if the floating point value returned is mathematically <= n, it's unclear whether it will still <= n if we cast it to an integer. And if we provide an additional implementation that returns an integer and document that the returned value <= n, it will be clear that the returned value can be used as an index to an array.

@saona-raimundo
Copy link
Contributor

I am not against it, and I like that we can give guarantees about the output.
I have a few comments about the choice of types if one would like to continue this line.

"integer" is not a type, and, if we are going to offer it for indexes, then usize should be used.
On the other hand, u64 and u32 are what "should" represent numerical integers, so they should be used too.

Lastly, the guarantee is with respect to the input to the constructor, so the constructor should accept either u32, u64, or usize, to impose the guarantee we want.

@dhardy
Copy link
Member

dhardy commented Jul 24, 2024

Fair comments.

At this point I believe a PR would be welcome (unless @vks has plans?).

@vks
Copy link
Collaborator

vks commented Jul 25, 2024

I'm not sure about supporting lots of integer types. It increases the API surface and code size for little benefit. I think it's preferable to only support u64 and let users cast it to smaller integer types.

@dhardy
Copy link
Member

dhardy commented Aug 10, 2024

Lets not worry about breaking changes now. Thus, it sounds like we want to:

  • Change n to type F in the current implementation.
  • Either also implement Zipf for integer types(1) or add a new type like ZipfU to support integer types (mostly depending on practicality)
  • (1) choose which integer types to support. Since the impls should just be wrappers over the f64 impl (2) and can be implemented via macro, the extra API surface of supporting u32, u64 and usize over just u64 is negligible
  • (2) For portability, usize must be implemented via f64 on both 32- and 64-bit platforms, but probably u32 should use f64 as its backend too (both because u32 supports values that f32 doesn't and because f64 should be just as fast as f32 on most modern targets).

@seekstar
Copy link
Author

I agree that just supporting u64 is fine, as long as the support for u32 can be added without breaking change in the future in case someone wants it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-easy Participation: easy job
Projects
None yet
4 participants