-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimentation MVP scope - 1.31.0 2/2 [Core analytics] #7418
Comments
Some questions:
|
Feedback
Questions
Nits:
|
|
Phrasing it differently, is the MVP we want to build something like: We have a user who wants to run experiments over a funnel only
Judging from the above problems, it seems like UX/UI isn't going to be a big deal, so we can target building the functionality quickly. I feel if we start from here, i.e. the subset of problems (the core) the MVP should solve, we can probably come up with a better/smaller implementation that targets just these problems. I personally don't know enough about how users are going to interact with experimentation to solidify an approach right now, hence would target this MVP for getting tangible real-world experimentation feedback. I'm spending a bit more time looking into background material / what other companies do - this morning, to get a better idea of what to build. Thanks for the rec, Marcus! Would also be curious to hear how you & Paolo have been using tools like these in the past, to better refine the problems we want to solve? Edit: I obviously have glossed over a lot of the implementation complexity here (and short-circuited some complexity, like manually choosing people for the experiment), but that's to come right after this^. |
offhand note inspired by checking out Statsig: When it comes to existing A/B testing platforms, we can be fundamentally different in how we approach things, because we already have the context needed to make decisions. For example, every stand-alone A/B testing platform has a step in the process where you define a target metric / choose how users are sorted to be pass or fail based on some property (and then sync this with your implementation, so those users see the thing you expect them to see, outside of this platform). I think we can skip this completely, because our target metrics naturally come out of the insights users create. And how users are segregated comes naturally out of the defined FF. (This also has the added fun factor where users can see current values over time of exactly what they're going to optimise, because its an insight graph) This implies trying to copy most existing platforms is less useful for us, than taking inspiration from the problems they've solved well, and integrating that into our context (which has some other problems solved automagically). As Paolo mentioned elsewhere, Blaze does this well. They take the best of the platform they already have, and just introduce the few extra steps necessary to run an A/B test. |
+1 to this consideration. We're going to need to consolidate or clarify what experiment flags are vs regular feature flags—whether we show them together or not otherwise we'll end up with multiple places where you can do
Most powerful moves here would be leveraging everything that we have that A/B testing specific companies might not.
|
Thanks everyone for the feedback, there are many things going on I don't want this to become hard to follow, so will only tackle the most important points here. Please also see updated initial description.
|
Took me a while to gather all required information, but in regards to the actual test calculations, I think taking a frequentist approach (which has been the standard for 2000-2016(?)) is the wrong way to go. Reasons why
Supporting InformationSo you don't have to fumble through everything the world knows about A/B testing for 3 days, here's some links & choice quotes: Probably the most famous: How not to run an A/B test
What to do then?Go Bayesian. This is a very similar approach to what we did with correlations (likelihood odds over correlation score / confusion Matrix). One of the most succinct explanations I've found: Frequentist vs Bayesian Approach to A/B Testing
Some big cutting edge companies that have moved to a Bayesian approach: VWO , Dynamic Yield. Effectively, this doesn't require us to control sample sizes, the longer the experiment runs, the better the results. And, we get one number that literally everyone can understand: What's the probability that A has higher conversion rate than B? (The actual math behind it is related to Beta Distributions: Closed form, Monte-carlo simulations (Bayesian Statistics the fun way, ch 15), and a threshold for caring. I plan on implementing to see the difference between these few^ (going to be more of a UX thing, the results should be almost the same). With the monte-carlo approach, we can also have a precise answer over how much better is B over A. Something like: (which basically means B is 1.5 times better than A, most of the times) And we can pair this with the 3rd approach if we want users to be able to customise a threshold for caring. cc: @paolodamico @marcushyett-ph @kpthatsme @samwinslow - keen to hear thoughts of the Growth team on this as well :) More supporting information for doing your own research: https://github.com/gregberns/ABTesting - has a couple of good links to things to read https://www.dynamicyield.com/lesson/introduction-to-ab-testing/ - this is a full blown course Against bayesian A/B testing -(Basically saying "I'm a frequentist, and I can't set confidence intervals / choose statistical power - because those are the things I'm familiar with"). This is a valid criticism if we want to target people well versed with these A/B testing parameters, who've lived their life avoiding the gotchas present here. Going the bayesian route basically means learning a new thing, which can be off-putting. |
Awesome summary! I haven't read every link yet but the bayesian approach does seem better suited for decision making with approximate information rather than perfect. What are you imagining would be the process for defining an experiment? Would you just determine the variants and let it run until the threshold is crossed? |
This has been done partly offline & online - #7462 For defining an experiment, you select a FF & create a funnel insight (which becomes the metric you're optimising for. For a code explanation - #7492 - check out the test) - and there's a mockup by Paolo that's slightly out of date here: #7462 (comment) The clever bit is that the experiment results = Breakdown of funnel based on FF. The experiment ends whenever a user wants it to end, I'd say. We could do it based on a threshold as well - haven't given much thought yet to experiment end conditions. |
I like the Bayesian approach. Thank you for collecting all these resources @neilkakkar, and for deeply comprehending the math 😅 After reading the "Issues with Current Bayesian Approaches to A/B Testing" doc, many of the author's points seem to favor rigor over pragmatism. For example:
The above may be true, but among software startups, A/B testing is probably employed because the decision-maker's prior belief for the success of a particular product change is roughly 50/50 — if it was considerably far away from 50/50 they would just say #biasforaction and execute the change without the overhead of running a test. I will flip the conversation around, then, to ask who we're building this feature for. Because if it's most software startups / SaaS businesses, Bayes would work well, despite a small learning curve, because like Eric said it is built to accommodate incomplete information. But we may want to do some more digging however into what life sciences or fintech customers expect. Folks in those industries are generally quite particular and opinionated about the subtleties of the statistical measures they use. The preference for confidence intervals and such is especially strong in life sciences, and I don't know if that attitude extends into their digital product teams as well. No matter what, documentation and a user guide will be crucial to provide answers to a feature which will certainly cause our users to ask some questions. I could see a "10 Misconceptions About A/B Testing" listicle-type blog post being popular as well. |
@neilkakkar thanks for all the thought here, I'm always drawn in by doing something radically different, but we should make sure we validate it with our customers before going too far.
I would advocate for the go big / fail fast approach here, i.e. lets get some working version of the Bayesian experimentation out as quickly as possible and test it with customers (I can think of a few who might love this and a few who might be wedded to a more traditional approach) |
|
I suppose the scope of the MVP has been figured out as it's built already! |
As discussed in our standup today, here's a proposal for what the goal for Experimentation can be for this sprint. Please refer to the main RFC for details & context, https://github.com/PostHog/product-internal/blob/main/requests-for-comments/2021-10-22-collaboration.md
Goal
Moonshot: A/B testing suite MVP.
Problems to tackle
These are partial problems from the original document, scoped for a 2-week MVP.
Scope
Additional considerations
The text was updated successfully, but these errors were encountered: