Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better generation of Data.Vectors #4

Closed
Seelengrab opened this issue Feb 25, 2024 · 2 comments
Closed

Better generation of Data.Vectors #4

Seelengrab opened this issue Feb 25, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Seelengrab
Copy link
Owner

Seelengrab commented Feb 25, 2024

Currently, Data.Vectors generates an extremely skewed sampling of all allowed lengths:

bad_vec_distribution

(data generated through map(length, (example(Data.Vectors(Data.Integers{UInt8}())) for _ in 1:10_000)))

which means that properties that require long vectors to fail won't ever fail with these defaults:

julia> using Supposition

julia> vecgen = Data.Vectors(Data.Integers{UInt8}());

julia> vecgen.max_size |> Int
10000

julia> vecgen.min_size |> Int
0

julia> @check db=Supposition.NoRecordDB() function findMin(v=vecgen)
           length(v) < 100
       end;
Test Summary: | Pass  Total  
findMin       |    1      1  

The default maximum length is 10_000, and generating such long vectors is now extremely rare. The underlying reason for that is because the underlying generation just does an exponential backoff on the length:

Supposition.jl/src/data.jl

Lines 251 to 270 in 5082b3f

function produce(v::Vectors{T}, tc::TestCase) where T
result = T[]
# this does an exponential backoff - longer vectors
# are more and more unlikely to occur
# so results are biased towards min_size
while true
if length(result) < v.min_size
forced_choice!(tc, UInt(1))
elseif (length(result)+1) >= v.max_size
forced_choice!(tc, UInt(0))
break
elseif !weighted!(tc, 0.9)
break
end
push!(result, produce(v.elements, tc))
end
return result
end

There are two potential paths to fixing this:

  1. Tell each TestCase which "generation" of testcase it is, make that information accessible & incorporate it into produce.
    • This would mean implementing a form of iterative deepening, where later generations of TestCases are more and more likely to generate Vectors closer to the maximum length. The thinking here is that during generation, if a small example already exists we'd like to find it first, and only then try a longer & more complex one. This wouldn't change the naive plot above (shorter examples are better anyway), but should be visible in a histogram-per-generation style plot, resulting in an overall more bell-curve-like shape over the entire generation process.
  2. Implement Implement support for custom shrinkers #3, remove the exponential backoff completely and implement a custom shrinker for Vectors that takes the high-level structure of "the start has the length, everything after that are the elements" into account when shrinking.

For the short term (and because it doesn't require additional new & large features), option 1) will probably suffice, but a long term solution is likely going to be better with 2). 2) also has the advantage of requiring less memory overall, because we no longer have to encode the individual markers of where elements are into the choice sequence itself, but can rather have that information be stored as metadata only.

@Seelengrab Seelengrab added the bug Something isn't working label Feb 25, 2024
@Seelengrab
Copy link
Owner Author

Seelengrab commented Mar 1, 2024

I've spent a few days on option 1) now and have come to the conclusion that it's also not workable without #3. The fundamental problem is that without being aware of the structure the IR is meant to represent, the connection between the drawn length & the subsequently generated elements is too weak for the generic shortlex shrinker to take advantage of. Notably, the authors of Hypothesis come to much the same conclusion.

So for the short term, I'm probably just going to adopt the approach Hypothesis currently takes; that is, trying to target a given average length by selecting the probability of a coin that, more or less likely, will result in that length (while just stopping if it hits the maximum length). I'd really like it if that wouldn't result in exponential decay but in a Beta distribution of the acceptable lengths, but that'll have to wait.

@Seelengrab
Copy link
Owner Author

I've merged the experiments with the Beta distribution & the approach taken by Hypothesis. Essentially, we're biasing the average length based on the average of a beta distribution, which itself is influenced by the "generation" of a testcase. This results in these histograms:

good_vec_distribution

(from map(length, (example(Data.Vectors(Data.Integers{UInt8}())) for _ in 1:10_000)))

good_vec_distribution_example

(from map(length, (example(Data.Vectors(Data.Integers{UInt8}()), 10_000)))

These two show the distribution of single-examples (which have been refactored in dafcba9 to sample from the earlier generation only) as well as multiple examples (which sample from a wide spread of generations, even very late ones). I'm not entirely satisfied with these (ideally the larger generations would be more prominent than the middle ones), but the overall purpose (the ability to generate all sizes throughout Vectors.min_size:Vectors.max_size) has been achieved.

In particular, this is able to generate empty vectors at the beginning, and not generating them relatively soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant