Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guide to designing strategies for good shrinking behaviour. #1790

Merged
merged 6 commits into from
Mar 13, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
258 changes: 258 additions & 0 deletions guides/strategies-that-shrink.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
===================================
Designing strategies to shrink well
===================================

Reducing test cases to a minimal example is a great feature of Hypothesis,
the implementation of which depends on both the shrinking engine and the
structure of the strategy (or combination of strategies) which created the
example to reduce.

This document is organised into three parts:

1. How to tell if you need to think about shrinking (you probably don't!)
2. Designing for shrinking 'above' the Hypothesis public API
3. Implementation tricks used in our internals, for interested contributors

It is written for people implementing complex third-party strategies (such
as `hypothesis-networkx <https://pypi.org/project/hypothesis-networkx/>`__),
current or potential contributors to Hypothesis itself, and anyone interested
in how this works under the hood.


------------------------------------
Do you need to design for shrinking?
------------------------------------
You should only attempt to tune custom strategies for better shrinking
behaviour if more time would otherwise be spent reducing examples by hand
or debugging more complex examples. It *may* be worthwhile if:

- Your custom strategy will be used by many people, so that spending
the same effort tuning the strategy has much larger benefits, or
- You have personally spent time debugging failures which better example
shrinking could have avoided and think this might happen again.

If neither of these apply to you, relax! Hypothesis' test-case reduction
is among the best in the world, and our built-in strategies are carefully
designed to work well with it as discussed below.


------------------------------------
Shrinking for third-party strategies
------------------------------------

That is, strategies built out of other strategies until you get down to
Hypothesis' public API. These often but not always use ``@composite``.


Composition of shrinking
~~~~~~~~~~~~~~~~~~~~~~~~
The first and most important rule is that Hypothesis shrinks from the
'bottom up'. If any component of your strategy is replaced with a simpler
example, the end result should also become simpler. We usually try to define
"simpler" here to match a reasonable intuition about the strategy, and avoid
weird edge cases when it's combined with another strategy or predicate.

`Issue #1076 <https://github.com/HypothesisWorks/hypothesis/issues/1076>`_,
where magnitude constraints were added to the ``complex_numbers`` strategy,
makes a nice case study. We wanted to continue shrinking the real and
imaginary parts like ``builds(complex, floats(), floats())``.

In a worst-case scenario, the performance of filtering could be arbitarily
bad, while a 'generate and scale' approach would mean that simple inputs
could lead to irrational outputs. Instead, we choose an imaginary part
between +/- max_magnitute, then calculate the resulting bounds on the real
part and draw it from a strategy that will always be valid. This ensures
that the imaginary part shrinks to zero first, as we think real-valued
complex numbers are simpler than imaginary-valued complex numbers.


Let generation be lucky
~~~~~~~~~~~~~~~~~~~~~~~
Sometimes, it's worth searching for a particularly nasty value to try.
This trick should be used sparingly, and always behind a branch that the
shrinker can decide not to take such as ``if draw(booleans()):``, but might
occasionally worth trying. Measure the results before you keep it!

`Issue #69 <https://github.com/HypothesisWorks/hypothesis/issues/69>`_ provides
a nice case study: when generating tz-aware datetimes, we would like to generate
instants that are skipped or repeated due to a daylight-savings transition more
often than by chance. Of course, there may or may not be any such moments
allowed by the bounds and tz strategy!

Eliding much of the detail, a key part is to find such a moment between two
endpoints, when we can only check whether one or more exists. The traditional
approach would be to use a binary search, but this would be relatively expensive
to shrink as we would pay the log-n cost on every attemted shrink.

Instead of choosing the midpoint, we draw a *random* point between our known
endpoints, and repeat this until we find a satisfactory moment. This allows
the shrinker to delete all the intermediate draws - and appear lucky enough
to find the moment we were looking for on the first guess!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually forgotten this trick. It's a good trick.



Keep things local
~~~~~~~~~~~~~~~~~
Hypothesis' shrinking engine sees every example as a labelled tree of choices,
with possible reductions represented as operations on the tree. An attempted
shrink succeeds if the new tree can be converted into an example, and the
resulting example triggers the same bug in the test function.

The most common way we see users breaking data locality is by drawing a size,
then drawing a collection of that size. This is tempting because it's simple
and it _works_, but it's often much slower than the alternatives.

.. code:: python

# Both of these strategies can generate exactly the same kind of examples,
# but the second has better performance as well as style.
integers(0, 10).flatmap(lambda n: st.lists(..., min_size=n, max_size=n))
st.lists(..., min_size=1, max_size=10)

Another easy way to keep things local is to ensure that any ``.filter(...)``
or ``assume(...)`` calls you use are as close as possible to the relevant part
of the strategy. That way, Hypothesis can retry just the part that failed
instead of the entire strategy, which might be much slower.

For efficient shrinking, local operations on the tree should correspond with
valid (and preferably local) shrinks to the final example. For example:

.. code:: python

# This form of loop is hard to shrink, because we'd have to reduce `n` and
# delete something in the loop simultaneously. It's equivalent to the
# `.flatmap` example above. We _do_ shrink this, but much more slowly.
n = draw(integers(0, 10))
for _ in range(n):
...
draw(...)
...

# In this form, the shrinker can see a repeated struture of labels
# and delete one loop iteration without touching anything else.
# We use a variant of this trick to generate collections internally!
while draw(integers(0, x)) > threshold:
...
draw(...)
...

Similarly, it's better to draw all the attributes or inputs you need for an
object at the same time, again so they can be modified or deleted together.

The exact behaviour of the shrinking is a topic of active research and
development, so if you are interested in the details we recommend reading
the "internals.rst" guide and the well-commented source code in
``hypothesis.internal.conjecture``. An earlier (mid-2018) version is
illustrated in David's draft paper *Test-Case Reduction for Free*,
along with an extensive evaluation. Contact him if you would like a copy.


-------------------------------------
Shrinking in the Hypothesis internals
-------------------------------------
The last section is for current or prospective Hypothesis contributors only.

These tricks rely on implementation details that are not available to
third-party libraries or users, **and can change in any patch release**.
Occasionally they are also indispensible to get good performance in underlying
primitives, so please contact us if the public API is not enough and we may
be able to work something out.


What do internals get you?
~~~~~~~~~~~~~~~~~~~~~~~~~~
Using the low-level, internal APIs complements, rather than changing, the
principles above. The bytestream-level view has some important advantages:

Because we operate at the level of bits, the relationship between a value and
the corresponding buffer is much more obvious. If we're careful, that means
we can calculate the value we want and then write the corresponding buffer
to recreate it when the test case is shrunk or replayed.

A small step up from bits, we can also see the spans that indicate a subset
of the buffer to consider for various transformations such as transposition
or deletion.

Sometimes these features are the only way to maintain acceptable performance
in very rare or even pathological cases - consider shrinking a complex number
with a single allowed magnitude - but it's almost certain that someone will
need the core strategies to do just that.
However, using low-level APIs also comes at a cost - they are verbose and
generally more difficult to use, and can violate key invariants of the engine
if misused.

Internally, our strategies mostly use the public API or something that looks
a lot like ``@composite``, so it's fairly easy to follow along. There are
just a few tricks enabled by those low-level advantages that we wanted to
name and document, so we can recognise them discuss them and invent more...


Make your own luck
~~~~~~~~~~~~~~~~~~
This is the simplest trick that uses our ability to write choices to the
buffer. We use it in stateful testing, where there may be many rules but only
a few of them allowed by their preconditions, and "lucky generation" would
work but be very inefficient.

1. Draw an index into the unfiltered list of rules. Return the corresponding
rule if it's allowed - we got lucky! (or someone set us up...)
2. Create a list of allowed rules, and choose one from that shortlist instead.
3. Find the index of the chosen rule *in the unfiltered list*, and write that
index to the buffer. Finally, return the chosen rule.

When the shrinker tries to delete the first two draws, the resulting buffer
will lead to the same rule being chosen at step *one* instead. We've made
our own luck!

This trick is expecially useful when we want to avoid rejection sampling
(the ``.filter`` method, ``assume``) for performance reasons, but also
need to give the shrinker the same low-level represention for each instance
of a repeated choice.


Flags "shrink open"
~~~~~~~~~~~~~~~~~~~
An important insight from `Swarm Testing (PDF) <https://www.cs.utah.edu/~regehr/papers/swarm12.pdf>`__
is that randomly disabling some features can actually reduce the expected time
before finding a bug, because some bugs may be suppressed by otherwise common
features or attributes of the data.

As discussed on `issue #1401 <https://github.com/HypothesisWorks/hypothesis/issues/1401>`__,
there are a few points to keep in mind when implementing shrinkable swarm testing:

- You need swarm flags to "shrink open" so that once the shrinker has run to
completion, all flags are enabled. e.g. you could do this by generating a
set of banned flags.
- You need to use rejection sampling rather than anything more clever, or at
least look like it to the shrinker. (see e.g. *Make your own luck*, above)

Taking Unicode as an example, we'd like to use our knowledge of Unicode
categories to generate more complex examples, but shrink the generated string
without reference to categories. While we haven't actually implemented this
yet - it's pretty hairy - the simple version of the idea goes like this:

1. Generate a set of banned categories.
2. Use ``characters().filter(category_is_not_banned)``

When shrinking, we start by removing categories from the banned set, after
which characters in the string can be reduced as usual. In a serious version,
the make-your-own-luck approach would be essential to make the filter
reasonably efficient, but that's not a problem internally.

In more complicated structures, it would be nice to generate the flags on first
use rather than up front before we know if we need them. The trick there is
to write each flag to the buffer every time we check it, in such a way that if
we delete the first use the second turns into an initialisation.


Explicit example boundaries
~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is almost always handled implicitly, e.g. by ``cu.many``, but *sometimes*
it can be useful to explicitly insert boundaries around draws that should be
deleted simultaneously using ``data.start_example``. This is used to group
the value and sign of floating-point numbers, for example, which we split up
in order to provide a more natural shrinking order.

Explict example management can also be useful to delineate variably-sized
draws, such as our internal helper ``cu.biased_coin``, which makes eliminating
dead bytes much cheaper. Finally, labelling otherwise indistinguishable draws
means the shrinker can attempt to swap only the like values.