[WIP] Efficient strategy for filtered sampling #1862

Zalathar · 2019-03-09T23:41:10Z

As foreshadowed by #1857 (comment), this PR takes the shrinker-friendly filtered choice used for rule selection in stateful tests, and generalizes it into a user-visible strategy.

Given a list of values and a filter function, this strategy chooses one of those values that satisfies the filter. This is done in a way that is efficient regardless of whether the allowed values are dense or sparse, and that isn't over-sensitive to earlier draws that influence the filter.

The biggest open question is what the user-visible API should look like. I see three possibilities:

Expose a separate user-visible strategy.
Add a special filtering argument to sampled_from.
Add an implicit behaviour to sampled_from(...).filter(...).

Currently I've gone with (1), partly because I'm leaning in that direction, and partly because it was the easiest to implement.

Since this is a public API, feedback on the strategy name and its arguments would also be good. I mostly wanted to choose something so that I could put this PR up for review.

(I haven't written up any documentation or release notes yet, because I want to wait until after the API is nailed down. I would also like to add some more tests, if I can come up with good ones.)

Zalathar · 2019-03-10T01:30:56Z

Minor open question: How should this feature interact (or not interact) with enum.Flag, which has special support in sampled_from?

Zac-HD · 2019-03-10T05:32:43Z

Nice! It looks pretty good, and I really like the idea of making filtered-choice accessible to users. However I'd strongly prefer to build this in ala option three, for several reasons:

Adding a special optimised strategy means that the current and obvious way to do it, sampled_from(...).filter(...), would no longer be the right way to do it.
Smaller, composition-focused APIs are much easier for us to maintain, extend, and teach.
Adding a special strategy later is backwards-compatible, taking one out is much harder.

All-up, that would make this PR a patch release that does not affect the public API: e.g. "This patch improves the internal implementation of sampled_from(...).filter(...) strategies. Generation may be somewhat faster or slower depending on the specific inputs, but shrinking should be much faster."

Zac-HD · 2019-03-10T06:21:07Z

I also think there's a tactical optimisation worth making here:

allowed = [j for (j, v) in enumerate(values) if j != i and condition(v)]
if not allowed:
    data.mark_invalid()
j = choice(data, allowed)
data.stop_example()
data.draw_bits(block_length * 8, forced=j)
return values[j]

For large lists, or slow filter functions, calculating allowed could be very expensive. Maybe that's just life, but I think a three-stage approach would reduce our average number of filter calls at the cost of one additional integer_range:

# Optimistically choose a value, and hope that it passes the filter.  We make
# three attempts because it's still O(1) and we might have just been unlucky.
for _ in hrange(3):
    i = integer_range(data, 0, len(values) - 1)
    if condition(values[i]):
        data.stop_example()
        return values[i]

# If that didn't work, we calculate which values are valid, choose one,
# and write the index to the buffer so we can "be lucky" when shrinking.
# (see the strategy shrinking guide for details.
# As an optimisation, we speculatively choose an index into the list of
# allowed values before checking at most that many candidates.  If the
# index is in range, we use it; if not we have created a list of allowed
# indices and choose one.
# Use `len(values) - 2` because we know at least one is invalid, 
# and sampling zero or one values is special-cased elsewhere.
allowed_i = integer_range(data, 0, len(values) - 2)
allowed = []
for j, value in enumerate(values):
    if condition(value):
        if len(allowed) >= i:
            break
        allowed.append(j)
else:
    if not allowed:
        data.mark_invalid()
    j = choice(data, allowed)
data.stop_example()
data.draw_bits((len(values) - 1).bit_length(), forced=j)
return values[j]

Does that make sense? The sampling is still uniform because we're equally likely to choose any of the allowed indices in the first step, and if we don't get a valid index we're equally likely to get any of them in the second too.

Zalathar · 2019-03-10T07:34:28Z

Yeah, that sounds like a good idea.

After optimistic sampling fails, we would make n/2 predicate calls on average and n in the worst-case, instead of always making n calls.

And I also like the suggestion of trying multiple optimistic samples, for parity with the usual behaviour of filter. That should be more efficient when acceptable values are dense.

Zac-HD · 2019-03-10T07:47:24Z

we would make n/2 predicate calls on average and n in the worst-case, instead of always making n calls.

Our other tricks mean it's not quite this good sadly: it's n / (1 + allowed_frac) in the average case: approaching n/2 when almost everything is allowed, but average-case-n if almost everything is disallowed.

IMO it's still worth doing though, as the impact in slow cases could still be very large and it's never worse by more than a small constant amount (from drawing allowed_i).

DRMacIver · 2019-03-10T08:12:17Z

However I'd strongly prefer to build this in ala option three, for several reasons:

Yeah, I'm with Zac on this one. Option 3 is the one that commits us to the least publicly - we can always add one of the two options later if we feel their lack.

(I'm trying to avoid doing too much code at the weekend right now, so I haven't really looked at the code at all. Will try to get on top of my review backlog this coming week and I'll take a look at this then)

Zac-HD · 2019-03-10T08:45:00Z

I'm trying to avoid doing too much code at the weekend right now

Shoo! Go do something analogue, we'll see you next week 😁

Zalathar · 2019-03-10T12:38:48Z

I already have plenty of things to change based on feedback so far, so there's no rush for extra review of what's here already.

And option 3 seems like the way to go, so I'll drop the current user-visible stuff.

Zalathar · 2019-03-11T10:36:39Z

I thought I would be able to just override the filter method on SampledFromStrategy, but sadly things aren't that simple.

User-visible strategies get wrapped in a LazyStrategy, and it's that class's (inherited) filter implementation that gets called in practice, completely bypassing any sampling-specific filter implementation.

I'll have to find some reasonable way to resolve that situation.

DRMacIver · 2019-03-11T10:45:20Z

User-visible strategies get wrapped in a LazyStrategy, and it's that class's (inherited) filter implementation that gets called in practice, completely bypassing any sampling-specific filter implementation.

You could make it so that LazyStrategy delegates the implementation of its filter method of its wrapped strategy.

Zac-HD · 2019-03-11T10:48:30Z

Hmm, I would have tried teaching FilteredStrategy.do_draw about SampledFromStrategy. I guess there are lots of inelegant ways to plumb in special cases 😆

Zalathar · 2019-03-11T13:46:52Z

You could make it so that LazyStrategy delegates the implementation of its filter method of its wrapped strategy.

This would fix the immediate problem, but wouldn't it also undermine laziness, by forcing LazyStrategy to create its underlying strategy as soon as a filter is applied?

DRMacIver · 2019-03-11T13:50:57Z

This would fix the immediate problem, but wouldn't it also undermine laziness, by forcing LazyStrategy to create its underlying strategy as soon as a filter is applied?

Hmm. Yeah. I guess you could create a LazyFilteredStrategy class that on first draw forces the lazy strategy and then applies the filter to that? Would allow you to leave the repr unchanged so that it still shows the original nice filtered repr too.

(Our API has some really weird constraints)

Zalathar · 2019-03-19T09:40:56Z

Closing this for now because it requires some fiddly front-end work to deal with lazy wrappers, and that's currently not a priority for me.

If I decide to have another go at this, I'll re-open or re-file.

Zalathar added 4 commits March 9, 2019 18:03

Obtain the last block's length without having to create it

8f9afb8

Extract filtered_choice for shrinker-friendly filtered selection

abced94

Add the filtered_sample strategy

c1550ce

Add a placeholder RELEASE.rst to satisfy CI

369a2a9

Zalathar added opinions-sought tell us what you think about these ones! new-feature entirely novel capabilities or strategies labels Mar 9, 2019

Zalathar mentioned this pull request Mar 15, 2019

Obtain the last block's length without having to create it #1873

Merged

Zac-HD added performance go faster! use less memory! and removed new-feature entirely novel capabilities or strategies labels Mar 18, 2019

Zalathar closed this Mar 19, 2019

Zac-HD mentioned this pull request Mar 20, 2019

Better internal implementation for 'filtered choice' #1885

Closed

Zalathar mentioned this pull request Mar 23, 2019

"Make our own luck" in sampled_from(...).filter(...) #1890

Closed

Zalathar mentioned this pull request Mar 31, 2019

Add special-case filtering for SampledFromStrategy #1904

Merged

Zalathar deleted the filtered-sampling branch April 5, 2019 23:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Efficient strategy for filtered sampling #1862

[WIP] Efficient strategy for filtered sampling #1862

Zalathar commented Mar 9, 2019

Zalathar commented Mar 10, 2019

Zac-HD commented Mar 10, 2019

Zac-HD commented Mar 10, 2019

Zalathar commented Mar 10, 2019

Zac-HD commented Mar 10, 2019

DRMacIver commented Mar 10, 2019

Zac-HD commented Mar 10, 2019

Zalathar commented Mar 10, 2019

Zalathar commented Mar 11, 2019

DRMacIver commented Mar 11, 2019

Zac-HD commented Mar 11, 2019

Zalathar commented Mar 11, 2019

DRMacIver commented Mar 11, 2019

Zalathar commented Mar 19, 2019

[WIP] Efficient strategy for filtered sampling #1862

[WIP] Efficient strategy for filtered sampling #1862

Conversation

Zalathar commented Mar 9, 2019

Zalathar commented Mar 10, 2019

Zac-HD commented Mar 10, 2019

Zac-HD commented Mar 10, 2019

Zalathar commented Mar 10, 2019

Zac-HD commented Mar 10, 2019

DRMacIver commented Mar 10, 2019

Zac-HD commented Mar 10, 2019

Zalathar commented Mar 10, 2019

Zalathar commented Mar 11, 2019

DRMacIver commented Mar 11, 2019

Zac-HD commented Mar 11, 2019

Zalathar commented Mar 11, 2019

DRMacIver commented Mar 11, 2019

Zalathar commented Mar 19, 2019