Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Efficient strategy for filtered sampling #1862

Closed
wants to merge 4 commits into from

Conversation

Zalathar
Copy link
Contributor

@Zalathar Zalathar commented Mar 9, 2019

As foreshadowed by #1857 (comment), this PR takes the shrinker-friendly filtered choice used for rule selection in stateful tests, and generalizes it into a user-visible strategy.

Given a list of values and a filter function, this strategy chooses one of those values that satisfies the filter. This is done in a way that is efficient regardless of whether the allowed values are dense or sparse, and that isn't over-sensitive to earlier draws that influence the filter.


The biggest open question is what the user-visible API should look like. I see three possibilities:

  1. Expose a separate user-visible strategy.
  2. Add a special filtering argument to sampled_from.
  3. Add an implicit behaviour to sampled_from(...).filter(...).

Currently I've gone with (1), partly because I'm leaning in that direction, and partly because it was the easiest to implement.

Since this is a public API, feedback on the strategy name and its arguments would also be good. I mostly wanted to choose something so that I could put this PR up for review.

(I haven't written up any documentation or release notes yet, because I want to wait until after the API is nailed down. I would also like to add some more tests, if I can come up with good ones.)

@Zalathar Zalathar added opinions-sought tell us what you think about these ones! new-feature entirely novel capabilities or strategies labels Mar 9, 2019
@Zalathar
Copy link
Contributor Author

Minor open question: How should this feature interact (or not interact) with enum.Flag, which has special support in sampled_from?

@Zac-HD
Copy link
Member

Zac-HD commented Mar 10, 2019

Nice! It looks pretty good, and I really like the idea of making filtered-choice accessible to users. However I'd strongly prefer to build this in ala option three, for several reasons:

  • Adding a special optimised strategy means that the current and obvious way to do it, sampled_from(...).filter(...), would no longer be the right way to do it.
  • Smaller, composition-focused APIs are much easier for us to maintain, extend, and teach.
  • Adding a special strategy later is backwards-compatible, taking one out is much harder.

All-up, that would make this PR a patch release that does not affect the public API: e.g. "This patch improves the internal implementation of sampled_from(...).filter(...) strategies. Generation may be somewhat faster or slower depending on the specific inputs, but shrinking should be much faster."

@Zac-HD
Copy link
Member

Zac-HD commented Mar 10, 2019

I also think there's a tactical optimisation worth making here:

allowed = [j for (j, v) in enumerate(values) if j != i and condition(v)]
if not allowed:
    data.mark_invalid()
j = choice(data, allowed)
data.stop_example()
data.draw_bits(block_length * 8, forced=j)
return values[j]

For large lists, or slow filter functions, calculating allowed could be very expensive. Maybe that's just life, but I think a three-stage approach would reduce our average number of filter calls at the cost of one additional integer_range:

# Optimistically choose a value, and hope that it passes the filter.  We make
# three attempts because it's still O(1) and we might have just been unlucky.
for _ in hrange(3):
    i = integer_range(data, 0, len(values) - 1)
    if condition(values[i]):
        data.stop_example()
        return values[i]

# If that didn't work, we calculate which values are valid, choose one,
# and write the index to the buffer so we can "be lucky" when shrinking.
# (see the strategy shrinking guide for details.
# As an optimisation, we speculatively choose an index into the list of
# allowed values before checking at most that many candidates.  If the
# index is in range, we use it; if not we have created a list of allowed
# indices and choose one.
# Use `len(values) - 2` because we know at least one is invalid, 
# and sampling zero or one values is special-cased elsewhere.
allowed_i = integer_range(data, 0, len(values) - 2)
allowed = []
for j, value in enumerate(values):
    if condition(value):
        if len(allowed) >= i:
            break
        allowed.append(j)
else:
    if not allowed:
        data.mark_invalid()
    j = choice(data, allowed)
data.stop_example()
data.draw_bits((len(values) - 1).bit_length(), forced=j)
return values[j]

Does that make sense? The sampling is still uniform because we're equally likely to choose any of the allowed indices in the first step, and if we don't get a valid index we're equally likely to get any of them in the second too.

@Zalathar
Copy link
Contributor Author

Yeah, that sounds like a good idea.

After optimistic sampling fails, we would make n/2 predicate calls on average and n in the worst-case, instead of always making n calls.

And I also like the suggestion of trying multiple optimistic samples, for parity with the usual behaviour of filter. That should be more efficient when acceptable values are dense.

@Zac-HD
Copy link
Member

Zac-HD commented Mar 10, 2019

we would make n/2 predicate calls on average and n in the worst-case, instead of always making n calls.

Our other tricks mean it's not quite this good sadly: it's n / (1 + allowed_frac) in the average case: approaching n/2 when almost everything is allowed, but average-case-n if almost everything is disallowed.

IMO it's still worth doing though, as the impact in slow cases could still be very large and it's never worse by more than a small constant amount (from drawing allowed_i).

@DRMacIver
Copy link
Member

However I'd strongly prefer to build this in ala option three, for several reasons:

Yeah, I'm with Zac on this one. Option 3 is the one that commits us to the least publicly - we can always add one of the two options later if we feel their lack.

(I'm trying to avoid doing too much code at the weekend right now, so I haven't really looked at the code at all. Will try to get on top of my review backlog this coming week and I'll take a look at this then)

@Zac-HD
Copy link
Member

Zac-HD commented Mar 10, 2019

I'm trying to avoid doing too much code at the weekend right now

Shoo! Go do something analogue, we'll see you next week 😁

@Zalathar
Copy link
Contributor Author

I already have plenty of things to change based on feedback so far, so there's no rush for extra review of what's here already.

And option 3 seems like the way to go, so I'll drop the current user-visible stuff.

@Zalathar
Copy link
Contributor Author

I thought I would be able to just override the filter method on SampledFromStrategy, but sadly things aren't that simple.

User-visible strategies get wrapped in a LazyStrategy, and it's that class's (inherited) filter implementation that gets called in practice, completely bypassing any sampling-specific filter implementation.

I'll have to find some reasonable way to resolve that situation.

@DRMacIver
Copy link
Member

User-visible strategies get wrapped in a LazyStrategy, and it's that class's (inherited) filter implementation that gets called in practice, completely bypassing any sampling-specific filter implementation.

You could make it so that LazyStrategy delegates the implementation of its filter method of its wrapped strategy.

@Zac-HD
Copy link
Member

Zac-HD commented Mar 11, 2019

Hmm, I would have tried teaching FilteredStrategy.do_draw about SampledFromStrategy. I guess there are lots of inelegant ways to plumb in special cases 😆

@Zalathar
Copy link
Contributor Author

You could make it so that LazyStrategy delegates the implementation of its filter method of its wrapped strategy.

This would fix the immediate problem, but wouldn't it also undermine laziness, by forcing LazyStrategy to create its underlying strategy as soon as a filter is applied?

@DRMacIver
Copy link
Member

This would fix the immediate problem, but wouldn't it also undermine laziness, by forcing LazyStrategy to create its underlying strategy as soon as a filter is applied?

Hmm. Yeah. I guess you could create a LazyFilteredStrategy class that on first draw forces the lazy strategy and then applies the filter to that? Would allow you to leave the repr unchanged so that it still shows the original nice filtered repr too.

(Our API has some really weird constraints)

@Zac-HD Zac-HD added performance go faster! use less memory! and removed new-feature entirely novel capabilities or strategies labels Mar 18, 2019
@Zalathar
Copy link
Contributor Author

Closing this for now because it requires some fiddly front-end work to deal with lazy wrappers, and that's currently not a priority for me.

If I decide to have another go at this, I'll re-open or re-file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
opinions-sought tell us what you think about these ones! performance go faster! use less memory!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants