Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: PoC of meta strategy, which draws others by a given weight distribution #1734

Closed
wants to merge 1 commit into from

Conversation

kxepal
Copy link
Member

@kxepal kxepal commented Jan 3, 2019

Happy New Year! 🎅

IIRC, there was some issue about similar problem, but I failed to find it to make a reference.

Long story short: there was long time ago feature request to generate data with a given distribution. I wanted this feature as well and finally found some time to figure something out. Instead of generating data with some distribution, I decided to take make things simpler and just call specific strategies with a given distribution letting them handle the generation work.

However, I'm not sure about lot of things and here is the list of them:

  1. Did I reinvent the wheel?
  2. I found current implementation too naive and simple. I probably miss something oblivious, but not sure what (except the next point).
  3. How to make good tests for such strategies? E.g. verify that distribution is correct and not fall into generating big data for having statistically significance result.
  4. I noticed that conjecture.Sampler provides notable different results from builtin random.choices (3.6+) function. May be statistically everything is fine, but I often noticed that elements with lesser weight may be draw often than elements with higher one (in other words it happens notably often when weight 0.1 > 0.5) - this statement is partially incorrect, see my commentary below. Is there something to worry about? Or may be it's not suitable for my goals?
  5. Finally, do we need such kind of strategy out of the box?

So opening this PR for discussion and feedback.

@Zac-HD Zac-HD added the opinions-sought tell us what you think about these ones! label Jan 4, 2019
@Zac-HD Zac-HD requested a review from DRMacIver January 4, 2019 04:52
@kxepal
Copy link
Member Author

kxepal commented Jan 7, 2019

As about distribution issue, I made a quick test to see the difference between random and Sampler. The test code is:

max_iterations = 1000
max_examples = 1000
values = list(range(5))
weights = [0.1, 0.3, 0.5, 0.7, 1.0]

hypothesis_results = defaultdict(list)
random_results = defaultdict(list)

@settings(max_examples=max_examples)
@given(st.distributed(list(map(st.just, values)), weights=weights))
def hypothesis_test(thing):
    hypothesis_counter[thing] += 1
    
def random_test():
    for _ in range(max_examples):
        thing = random.choices(values, weights=weights)[0]
        random_counter[thing] += 1

for _ in range(max_iterations):
    hypothesis_counter = Counter()
    random_counter = Counter()

    hypothesis_test()
    random_test()

    for key, value in hypothesis_counter.items():
       hypothesis_results[key].append(value)

    for key, value in random_counter.items():
        random_results[key].append(value)

Which gives the following histograms:

Notice distribution of 3 and 4 values on second chart - it seems that values with lower weight (3 has 0.7) occurs often than values with higher (4 has weight 1.0). Also interesting differences between Sampler and random.choice distributions for value 2 (weight 0.3).

@Zac-HD
Copy link
Member

Zac-HD commented Jan 8, 2019

Finally, do we need such kind of strategy out of the box?

IMO no, we don't - and providing it will tend to mislead users who already think that Hypothesis examples have some distribution. Per this note in the docs, they don't; instead the engine is trying to cause an exception and we'll do whatever seems to work for that.

It is better to think about the data Hypothesis generates as being arbitrary, rather than random. We deliberately generate any valid data that seems likely to cause errors, so you shouldn’t rely on any expected distribution of or relationships between generated data.

@kxepal
Copy link
Member Author

kxepal commented Jan 8, 2019

All good point and that's why I'm so in doubts. Especially, how to implement a good shrinker. Obliviously, in my case it would be a shirk of a shrinking of all the provided strategies, but which one to pick in the end? Still random? Bad idea.

My only case now is that like I'm doing data science and to test my algorithms I need to test them against the data which is excepted to have some proportions of the things which have some special distribution to work correctly. Say, while I train my model on a data with distribution of clicks to shows where they equal to each other as 1 to 10, the result score should be like no lesser then 0.75. Still doesn't completely understand this magic, but so it is.

@Zac-HD
Copy link
Member

Zac-HD commented Jan 8, 2019

What size of data? 10 values, 1000, more? If you're looking at statistical rather than absolute properties of medium-to-large datasets I'd go for traditional random inputs over Hypothesis for performance reasons.

If generating with Hypothesis, I'd use a distribution-free strategy and pretend it's uniform-random. Then map a transform into my desired distribution and filter by checking the distribution really is OK. The main trick is to ensure that as the input data is shrunk (smaller collection, numbers to zero, etc), the transformed output should stay valid and shrink along a similar and sensible gradient.

@kxepal
Copy link
Member Author

kxepal commented Jan 8, 2019

What size of data? 10 values, 1000, more?

Or more, even may be 10K+, otherwise they make no sense. Alternative is to store some predefined csv dump files which holds the right data with the right values and no one should be allowed to change those files without a great reason and blessing of a data chief - which all looks horrible. There is a edgy case when the data with notable distribution property have to be tested against a result with certain properties - it looks like a case for property-testing framework like Hypothesis, but there are good questions around if is it really is? All those doubts again.

@Zac-HD
Copy link
Member

Zac-HD commented Jan 8, 2019

For 10K values, Hypothesis will be slower to generate than the stdlib random module, and unless the data size is variable or most values can be zeroed it probably won't shrink well either 😕

Without knowing any detail at all, I'd write a function that takes a seed and generates a valid dataset for you. Then you can use pytest.mark.parametrize to supply however many seeds you like, and just increase the number to test more thoroughly. e.g.

def make_example_data(seed=0):
   assert isinstance(seed, int)
   r = random.Random(seed)
   data = [r.random() for _ in range(10 ** 4)]  # or whatever distribution...
   return data

You can even use this with Hypothesis: just combine integers().map(make_example_data) with whatever other inputs you have.

@kxepal
Copy link
Member Author

kxepal commented Jan 8, 2019

@Zac-HD
Hm, I didn't think about this. Thanks for an idea! I'll try it.

@Zac-HD
Copy link
Member

Zac-HD commented Jan 10, 2019

I'll try it.

Is it working? The suspense is killing me here 😁

Either way, I've been thinking about it and I don't think we can support both explicit distributions (as public API) and also do the kind of exploration proposed in #1637, which is all about trying to tune up the frequency of edge cases. I'm therefore inclined to close this PR 😞

@kxepal
Copy link
Member Author

kxepal commented Jan 10, 2019

Is it working? The suspense is killing me here

Sorry. Didn't tried yet, but you gave me few ideas and I think this could be closed. I probably focused too much on solving problem from the wrong side (:

@kxepal kxepal closed this Jan 10, 2019
@kxepal
Copy link
Member Author

kxepal commented Jan 10, 2019

@Zac-HD
The only question remains is Sampler distribution results. They looks suspicious for me while this class is used IIRC for in Integer strategies. Anything to worry about?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
opinions-sought tell us what you think about these ones!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants