-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better generation of Data.Vector
s
#4
Comments
I've spent a few days on option 1) now and have come to the conclusion that it's also not workable without #3. The fundamental problem is that without being aware of the structure the IR is meant to represent, the connection between the drawn length & the subsequently generated elements is too weak for the generic shortlex shrinker to take advantage of. Notably, the authors of Hypothesis come to much the same conclusion. So for the short term, I'm probably just going to adopt the approach Hypothesis currently takes; that is, trying to target a given average length by selecting the probability of a coin that, more or less likely, will result in that length (while just stopping if it hits the maximum length). I'd really like it if that wouldn't result in exponential decay but in a Beta distribution of the acceptable lengths, but that'll have to wait. |
I've merged the experiments with the Beta distribution & the approach taken by Hypothesis. Essentially, we're biasing the average length based on the average of a beta distribution, which itself is influenced by the "generation" of a testcase. This results in these histograms: (from (from These two show the distribution of single-examples (which have been refactored in dafcba9 to sample from the earlier generation only) as well as multiple examples (which sample from a wide spread of generations, even very late ones). I'm not entirely satisfied with these (ideally the larger generations would be more prominent than the middle ones), but the overall purpose (the ability to generate all sizes throughout In particular, this is able to generate empty vectors at the beginning, and not generating them relatively soon. |
Currently,
Data.Vector
s generates an extremely skewed sampling of all allowed lengths:(data generated through
map(length, (example(Data.Vectors(Data.Integers{UInt8}())) for _ in 1:10_000))
)which means that properties that require long vectors to fail won't ever fail with these defaults:
The default maximum length is
10_000
, and generating such long vectors is now extremely rare. The underlying reason for that is because the underlying generation just does an exponential backoff on the length:Supposition.jl/src/data.jl
Lines 251 to 270 in 5082b3f
There are two potential paths to fixing this:
TestCase
which "generation" of testcase it is, make that information accessible & incorporate it intoproduce
.TestCase
s are more and more likely to generateVector
s closer to the maximum length. The thinking here is that during generation, if a small example already exists we'd like to find it first, and only then try a longer & more complex one. This wouldn't change the naive plot above (shorter examples are better anyway), but should be visible in a histogram-per-generation style plot, resulting in an overall more bell-curve-like shape over the entire generation process.Vector
s that takes the high-level structure of "the start has the length, everything after that are the elements" into account when shrinking.For the short term (and because it doesn't require additional new & large features), option 1) will probably suffice, but a long term solution is likely going to be better with 2). 2) also has the advantage of requiring less memory overall, because we no longer have to encode the individual markers of where elements are into the choice sequence itself, but can rather have that information be stored as metadata only.
The text was updated successfully, but these errors were encountered: