Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MetaSchedule][M3a] Add Sampling Primitive SampleCategorical. #8817
[MetaSchedule][M3a] Add Sampling Primitive SampleCategorical. #8817
Changes from 27 commits
3af2d12
121e658
1ca4aed
fcf1d67
ce6e806
d546c38
b7542f1
cb0d96e
d735664
35aa00c
91454b1
3b30c93
ef4fda1
6e740b4
73f57e8
e937881
0da7894
e75ed2b
9f40d16
83cff2f
3d2d5d2
8efe5a3
93878a2
b67f14a
72e2456
ce8e6bb
b2fffa0
414f440
d7a545e
58de4a9
cb25711
f9c5458
b541d49
5a6b2d3
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like ForkSeed is analogous to what is called "splitting" in the random number generator literature. I'm not quite an expert on this, but I did do a bit of research into PRNGS for the Threefry implementation we have. Everything I read says that there are no proofs to the validity of splitting LCGs (is the method you use here from a paper?). The paper "Splittable Pseudorandom Number Generators using Cryptographic Hashing" provides some good explanations.
In practice, I expect we will see some issues. If this function somehow perfectly bisects the space of random numbers generated by this PRNG, then we could expect to start seeing repeats of previous random numbers after 31 splits. Given that this splitting does not perfectly bisect the space, I'd assume that we start seeing repeats much sooner. Repeating portions of the search space may mean that we may no be able to visit the entire search space during tuning or that we may bias results towards a certain section of the space.
I'd suggest we adopt a splittable PRNG here as that appears the be what we need. Maybe we can find an existing implementation online as implementing your own PRNG can have subtle issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LCGs are pretty easy to be cracked in terms of security, but our search isn't something where there is an adversarial against you haha.
To be clear, we don't split the RNG too many times. It is only used in terms of multi-threaded search where we split the RNG for each thread, where in practice we didn't see repetition or any problem caused by repetition when running tens of real-world workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't most modern machines have at least 31 hyper threads, i.e we will split at least 31 times on those machines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually agree with Tristan's theory in general. Thank you for bringing this up! Indeed seeding of parallel PRNG would require some really careful thought to avoid quick repetition. LCG may not be the best candidate to ensure such a property.
Fortunately, in our particular use case it is not a practical problem. Here is a quick example, supposing we have 128 threads and 10k trials: https://gist.github.com/junrushao1994/ea986add81b01b89fd99a5a7d41d087a. The result is that there is no repetition at all. This is a harsher condition than our practical usage.
To further address the issue, architecturally we have designed the PRNG interface to be generic and compliant to STL, and easily switchable to any splittable PRNG in the future if there are new interesting usecases. Therefore, I assume it won't constitute an architecture issue :-)
Thanks again for the discussion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming late to the discussion. I read the thread yesterday evening and wanted
to spend more time thinking so did not reply immediately.
Please let me try to summarize and share some thoughts
As we generate random numbers, PRNG state circles through the space of states
and eventually will come back to its initial state to form a circle.
When we split the state of PRNG to start parallel generators, if the
splitted states are "too close" in terms of their traversal distance,
then we get repeatitions, or streams that are correlated to each other.
The following pts about the PRNGs
that repeats can happen in different streams. The nature of such
possibility depends on the splitting method, the seed and PRNG itself.
My read is that @tkonolige is right about A0 and A1 and seems we also agree.
The particular number
31
however might not directly match the particularscencario(as junru's emperical experiment suggested otherwise). The repeat
depends on the splitting method, the seed and PRNG itself and I personally
can not tell what will happen in this case for sure.
Because of the A0 and A1, it would be helpful need to consider the implication of
using a possibly correlated PRNG streams. In our case, we use PRNG
to generate explorations in the search space and perform the following task:
To make things simple, let us assume that there are two streams in the 32 threads
that are exactly identical to each other. In this case, we waste the computation
from one of the thread, because it will result in exactly the same set of samples.
Because our goal is to explore the space and find a maximum point. This would mean
that the computation from that particular thread is wasted, and we get less statistical
efficiency from the sampling process.
At a high level, we have two parameters we can tweak, the number of sampling steps n,
and number of threads K. What we are saying is that the statistical efficiency of running
uniform sampling over the search space scales linearly with n, but perhaps less optimally
wrt to K. For other more complicated tasks, such as estimating density of certain regions.
We know that the sampling and averaging over time(n) always can give us the right estimation.
Correlation across streams, would make averaging over streams(K) less efficient because the
random numbers are not independent, but we will still get the correct mean as we increase n.
So in summary:
full-linear efficiency in terms of number of thread K. The sampling will still
effective samples wrt to n(the number of samples we take in each thread).
Note that the real sampling scenario is much more complicated. As junru's experiments
showed that repeatition did not happen on the particular case for quite long period of time.
Additionally, the sampling steps are dependent on each other(they form a Markov Chain),
so it is hard to tell the end effect correlation between random sequence (abcde) and (bcdef),
even if they are one step apart. While in most of the cases they can be many steps apart
emperically. To summarize
This does not preclude that correlation won't happen(as it is not a proof), it does suggest
that correlation may not be large in most cases.
The end effect of A2 has a quite close analogy in parallel computing: as we start to use
K thread, we may not exactly get Kx speedups. Except that in this case it is not due to
hardware reasons, it is due to the possible correlation. As in parallel computing,
we can run the program longer, in our case increase n to compensate the possible loss
of efficiency. In short, A2 and A3 together might suggest that parallel stream correlation may not be the problem
that we need to perfectly solve, as long as it does not become very bad(e.g. all streams are the same).
Yesterday I did not think of A2 in particular, which might change our perspective. So I would
like to share this summary here. Would be great to get your takes as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tqchen I happened to implement a BSGS discrete logarithm this morning. This is a simple but effective algorithm (but not effective enough for crypto) we use in high school competitive programming: https://gist.github.com/junrushao1994/d32f265f5b4815d4b346d6022e95f394.
I use this script to find out what the minimal number of trials is required for a first repeat to happen given
num_threads
threads, and here is the outcome when I setnum_threads=1000
.In a word, in practice the conflict with the 0-th thread won't happen after 1407035 trials in the first 999 threads which split this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tqchen @junrushao1994 You both lay out a lot of interesting points here, but I'm not sure I have the expertise to evaluate them. The PRNGS themselves might appear simple, but analysis of their randomness is complicated and non-intuitive. Looking at the paper I linked above, you can get subtle bugs if the PRNG is used incorrectly. I've tested the LCG implemented in TVM with some PRNG test suites (you can try it yourself here: https://github.com/tkonolige/prng-tests), and it fails all of them. This result is unsurprising because LCGs aren't particularly good random number generators, but it just adds a little to my concern.
Given that we want to avoid any potential issues, why don't we just do things the right way and use a splittable PRNG? This page (https://www.pcg-random.org/posts/some-prng-implementations.html) lists some implementations of PRNGs including SplitMix which is splittable. (pcg-random appears to be a reputable source, it is run by the create of the PCG family of PRNGS). It seems like there is basically no overhead to just dropping this SplitMix implementation into the codebase. And then we won't have to worry about any bugs due to bad randomness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to block this PR on this, so I'm going to approve. But I would like us to fix this in the future.