-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Efficient strategy for filtered sampling #1862
Conversation
Minor open question: How should this feature interact (or not interact) with |
Nice! It looks pretty good, and I really like the idea of making filtered-choice accessible to users. However I'd strongly prefer to build this in ala option three, for several reasons:
All-up, that would make this PR a patch release that does not affect the public API: e.g. "This patch improves the internal implementation of |
I also think there's a tactical optimisation worth making here: allowed = [j for (j, v) in enumerate(values) if j != i and condition(v)]
if not allowed:
data.mark_invalid()
j = choice(data, allowed)
data.stop_example()
data.draw_bits(block_length * 8, forced=j)
return values[j] For large lists, or slow filter functions, calculating # Optimistically choose a value, and hope that it passes the filter. We make
# three attempts because it's still O(1) and we might have just been unlucky.
for _ in hrange(3):
i = integer_range(data, 0, len(values) - 1)
if condition(values[i]):
data.stop_example()
return values[i]
# If that didn't work, we calculate which values are valid, choose one,
# and write the index to the buffer so we can "be lucky" when shrinking.
# (see the strategy shrinking guide for details.
# As an optimisation, we speculatively choose an index into the list of
# allowed values before checking at most that many candidates. If the
# index is in range, we use it; if not we have created a list of allowed
# indices and choose one.
# Use `len(values) - 2` because we know at least one is invalid,
# and sampling zero or one values is special-cased elsewhere.
allowed_i = integer_range(data, 0, len(values) - 2)
allowed = []
for j, value in enumerate(values):
if condition(value):
if len(allowed) >= i:
break
allowed.append(j)
else:
if not allowed:
data.mark_invalid()
j = choice(data, allowed)
data.stop_example()
data.draw_bits((len(values) - 1).bit_length(), forced=j)
return values[j] Does that make sense? The sampling is still uniform because we're equally likely to choose any of the allowed indices in the first step, and if we don't get a valid index we're equally likely to get any of them in the second too. |
Yeah, that sounds like a good idea. After optimistic sampling fails, we would make n/2 predicate calls on average and n in the worst-case, instead of always making n calls. And I also like the suggestion of trying multiple optimistic samples, for parity with the usual behaviour of |
Our other tricks mean it's not quite this good sadly: it's IMO it's still worth doing though, as the impact in slow cases could still be very large and it's never worse by more than a small constant amount (from drawing |
Yeah, I'm with Zac on this one. Option 3 is the one that commits us to the least publicly - we can always add one of the two options later if we feel their lack. (I'm trying to avoid doing too much code at the weekend right now, so I haven't really looked at the code at all. Will try to get on top of my review backlog this coming week and I'll take a look at this then) |
Shoo! Go do something analogue, we'll see you next week 😁 |
I already have plenty of things to change based on feedback so far, so there's no rush for extra review of what's here already. And option 3 seems like the way to go, so I'll drop the current user-visible stuff. |
I thought I would be able to just override the User-visible strategies get wrapped in a I'll have to find some reasonable way to resolve that situation. |
You could make it so that |
Hmm, I would have tried teaching |
This would fix the immediate problem, but wouldn't it also undermine laziness, by forcing |
Hmm. Yeah. I guess you could create a (Our API has some really weird constraints) |
Closing this for now because it requires some fiddly front-end work to deal with lazy wrappers, and that's currently not a priority for me. If I decide to have another go at this, I'll re-open or re-file. |
As foreshadowed by #1857 (comment), this PR takes the shrinker-friendly filtered choice used for rule selection in stateful tests, and generalizes it into a user-visible strategy.
Given a list of values and a filter function, this strategy chooses one of those values that satisfies the filter. This is done in a way that is efficient regardless of whether the allowed values are dense or sparse, and that isn't over-sensitive to earlier draws that influence the filter.
The biggest open question is what the user-visible API should look like. I see three possibilities:
sampled_from
.sampled_from(...).filter(...)
.Currently I've gone with (1), partly because I'm leaning in that direction, and partly because it was the easiest to implement.
Since this is a public API, feedback on the strategy name and its arguments would also be good. I mostly wanted to choose something so that I could put this PR up for review.
(I haven't written up any documentation or release notes yet, because I want to wait until after the API is nailed down. I would also like to add some more tests, if I can come up with good ones.)