Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rand_distr: Add Zipf distribution #1136

Merged
merged 26 commits into from
Aug 4, 2021
Merged

rand_distr: Add Zipf distribution #1136

merged 26 commits into from
Aug 4, 2021

Conversation

vks
Copy link
Collaborator

@vks vks commented Jun 13, 2021

Fixes #1069.

@vks
Copy link
Collaborator Author

vks commented Jun 13, 2021

cc @kaimast

@vks
Copy link
Collaborator Author

vks commented Jun 13, 2021

I did not compare to the reference implementation in numpy yet.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rand_distr/src/zipf.rs Outdated Show resolved Hide resolved
@vks
Copy link
Collaborator Author

vks commented Jun 14, 2021

May be worth comparing this? https://jasoncrease.medium.com/rejection-sampling-the-zipf-distribution-6b359792cffa

On a first look, I'm not sure this method is different, but I would have to do the math to be sure.

@vks
Copy link
Collaborator Author

vks commented Jun 15, 2021

@dhardy The parametrization of the Zipf distribution you linked is different:

  • It uses the rank N and the exponent s >= 0, returning an integer from [1, N]. This follows Wikipedia.
  • The current implementation only has the exponent a >= 0 as a parameter and returns integers >= 1. This follows numpy.

Which paramtrization should I implement?

@dhardy
Copy link
Member

dhardy commented Jun 15, 2021

Sorry, I don't feel qualified to answer that.

@vks
Copy link
Collaborator Author

vks commented Jun 15, 2021

@dhardy Fair enough, I'm also not sure what's best without having a use case in mind.

@kaimast Any preferences? What's your use case for sampling the Zipf distribution?

@vks
Copy link
Collaborator Author

vks commented Jun 15, 2021

I also found an R library using the N-s parametrization.

The a parametrization is a limit of the N-s parametrization for N -> oo. So maybe N-s is a better choice? Or both should be supported?

Wikipedia calls a the zeta distribution, so maybe we should do the same.

@vks
Copy link
Collaborator Author

vks commented Jun 16, 2021

@dhardy I chose to implement both distributions. Zeta is equivalent to numpy.random.zipf, Zipf is equivalent to the Java implementation in the blog post you linked. I had to derive the correct equations for the s = 1 special case.

@dhardy dhardy added D-review Do: needs review E-help-wanted Participation: help wanted labels Jul 8, 2021
@saona-raimundo
Copy link
Contributor

Hi @vks!
@dhardy asked me if I could help reviewing this implementation, so I hope you don't mind some comments here and there :)

Looking at the previous comments you had quite a journey with names (and yeah, it is not very clear). But you got it right!

I have looked only at the implementation of Zeta for now. I added some suggestions on the documentation, but wanted to highlight that I have some questions about the sampling algorithm: I found the reference of the algorithm [Devroye86], so we have proved correctedness :) and my only doubts is about floating point representability... Well, we will talk more in the comments, I think.

Naming

I had a look at some references and leave a summary here. Maybe this is useful for the documentation, but I will leave it to you.

  1. The general family of distributions is named Lerch [JKK05], after the Lerch zeta function. It is also called Hurwitz-Lerch [USB18].

  2. There are many related distributions (at least 8 variants) [JKK05].

  3. As you noticed, there are two variants (sometimes called the same): Zeta and Zipf.

  • Zeta distribution has support over all integers 1, 2, ...
  • Zipf distribution has support only over 1, 2, ..., N (it is simply the truncated version).
  1. Other names for Zeta are: discrete Pareto, Riemann-Zeta, Zipf and Zipf–Estoup.

  2. Special cases of Zipf distribution have names too: Estoup for s = 1 and Lokta for s = 2.

References

  • [JKK05] Section 11.2.20, page 526.
  • [Devroye86] Section 6.1, page 551.
  • [USB18] Chapter 3, Section 3.2.3, page 117.

@saona-raimundo
Copy link
Contributor

Perfect! I understand the reasoning.

I would suggest saying something in the documentation about returning floats for integer random variables, but that deserves its own discussion and PR. This would involve investigating what is the effective domain of our implementations... information which is valuable for the user, but hard to get (or depends on parameters).

@vks
Copy link
Collaborator Author

vks commented Jul 28, 2021

I would suggest saying something in the documentation about returning floats for integer random variables, but that deserves its own discussion and PR. This would involve investigating what is the effective domain of our implementations... information which is valuable for the user, but hard to get (or depends on parameters).

There was some discussion in #987 and #1093, which resulted in some documentation in the Rand book.

Copy link
Contributor

@saona-raimundo saona-raimundo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the comments about Zipf.

I have some disagreements with some of the computations, I tried to explain the reasoning in each case. Feel free to ask for more clarification.

Notably, I also propose to introduce one more error variant NTooBig.

rand_distr/src/zipf.rs Show resolved Hide resolved
rand_distr/src/zipf.rs Show resolved Hide resolved
rand_distr/src/zipf.rs Show resolved Hide resolved
rand_distr/src/zipf.rs Show resolved Hide resolved
rand_distr/src/zipf.rs Outdated Show resolved Hide resolved
rand_distr/src/zipf.rs Outdated Show resolved Hide resolved
rand_distr/src/zipf.rs Outdated Show resolved Hide resolved
@saona-raimundo
Copy link
Contributor

There was some discussion in #987 and #1093, which resulted in some documentation in the Rand book.

I see! Thanks, I did not know about it :)

rand_distr/src/zipf.rs Outdated Show resolved Hide resolved
@vks
Copy link
Collaborator Author

vks commented Aug 2, 2021

Another case we might want to consider: For large a, the parameter b will be infinite, which results in all x being accepted. This is probably the best we can do, but we could also rather return an error when Zeta is constructed.

@vks
Copy link
Collaborator Author

vks commented Aug 3, 2021

Because this PR is already quite large, I would like to leave the extended distribution tests for a follow-up PR.

Copy link
Contributor

@saona-raimundo saona-raimundo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhardy we are finished with the review and agree on adding distributional tests in a subsequent PR.

@vks vks requested a review from dhardy August 4, 2021 09:56
Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. On that basis I approve merging. And thanks for stepping in to review.

@vks
Copy link
Collaborator Author

vks commented Aug 4, 2021

@saona-raimundo Thanks for the extensive review!

@dhardy Should I squash before merging?

@vks vks removed the D-review Do: needs review label Aug 4, 2021
@dhardy
Copy link
Member

dhardy commented Aug 4, 2021

@vks can't say I care too much whether or not it gets squashed. The reason I didn't merge myself is because I wasn't quite sure whether you were ready to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-help-wanted Participation: help wanted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Zipf distribution
3 participants