-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: PoC of meta strategy, which draws others by a given weight distribution #1734
Conversation
IMO no, we don't - and providing it will tend to mislead users who already think that Hypothesis examples have some distribution. Per this note in the docs, they don't; instead the engine is trying to cause an exception and we'll do whatever seems to work for that.
|
All good point and that's why I'm so in doubts. Especially, how to implement a good shrinker. Obliviously, in my case it would be a shirk of a shrinking of all the provided strategies, but which one to pick in the end? Still random? Bad idea. My only case now is that like I'm doing data science and to test my algorithms I need to test them against the data which is excepted to have some proportions of the things which have some special distribution to work correctly. Say, while I train my model on a data with distribution of clicks to shows where they equal to each other as 1 to 10, the result score should be like no lesser then 0.75. Still doesn't completely understand this magic, but so it is. |
What size of data? 10 values, 1000, more? If you're looking at statistical rather than absolute properties of medium-to-large datasets I'd go for traditional random inputs over Hypothesis for performance reasons. If generating with Hypothesis, I'd use a distribution-free strategy and pretend it's uniform-random. Then map a transform into my desired distribution and filter by checking the distribution really is OK. The main trick is to ensure that as the input data is shrunk (smaller collection, numbers to zero, etc), the transformed output should stay valid and shrink along a similar and sensible gradient. |
Or more, even may be 10K+, otherwise they make no sense. Alternative is to store some predefined csv dump files which holds the right data with the right values and no one should be allowed to change those files without a great reason and blessing of a data chief - which all looks horrible. There is a edgy case when the data with notable distribution property have to be tested against a result with certain properties - it looks like a case for property-testing framework like Hypothesis, but there are good questions around if is it really is? All those doubts again. |
For 10K values, Hypothesis will be slower to generate than the stdlib Without knowing any detail at all, I'd write a function that takes a seed and generates a valid dataset for you. Then you can use def make_example_data(seed=0):
assert isinstance(seed, int)
r = random.Random(seed)
data = [r.random() for _ in range(10 ** 4)] # or whatever distribution...
return data You can even use this with Hypothesis: just combine |
@Zac-HD |
Is it working? The suspense is killing me here 😁 Either way, I've been thinking about it and I don't think we can support both explicit distributions (as public API) and also do the kind of exploration proposed in #1637, which is all about trying to tune up the frequency of edge cases. I'm therefore inclined to close this PR 😞 |
Sorry. Didn't tried yet, but you gave me few ideas and I think this could be closed. I probably focused too much on solving problem from the wrong side (: |
@Zac-HD |
Happy New Year! 🎅
IIRC, there was some issue about similar problem, but I failed to find it to make a reference.
Long story short: there was long time ago feature request to generate data with a given distribution. I wanted this feature as well and finally found some time to figure something out. Instead of generating data with some distribution, I decided to take make things simpler and just call specific strategies with a given distribution letting them handle the generation work.
However, I'm not sure about lot of things and here is the list of them:
random.choices
(3.6+) function.May be statistically everything is fine, but I often noticed that elements with lesser weight may be draw often than elements with higher one (in other words it happens notably often when weight 0.1 > 0.5)- this statement is partially incorrect, see my commentary below. Is there something to worry about? Or may be it's not suitable for my goals?So opening this PR for discussion and feedback.