Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimentation MVP scope - 1.31.0 2/2 [Core analytics] #7418

Closed
paolodamico opened this issue Nov 30, 2021 · 14 comments
Closed

Experimentation MVP scope - 1.31.0 2/2 [Core analytics] #7418

paolodamico opened this issue Nov 30, 2021 · 14 comments
Labels

Comments

@paolodamico
Copy link
Contributor

paolodamico commented Nov 30, 2021

As discussed in our standup today, here's a proposal for what the goal for Experimentation can be for this sprint. Please refer to the main RFC for details & context, https://github.com/PostHog/product-internal/blob/main/requests-for-comments/2021-10-22-collaboration.md

Goal

Moonshot: A/B testing suite MVP.

Problems to tackle

These are partial problems from the original document, scoped for a 2-week MVP.

  • P1. Defining experiments with a clear goal. Answering what are the results of the experiment.
  • P2. Experiments should run quickly. Answering how long should the experiment run for, given parameters.
  • P3. Experiments must be statistically significant. Answering whether the experiment produced statistically significant results.

Scope

  • Basic experiment planning
    • Users can create an experiment with a target metric (only simple trends metric or conversion rates for now) and assign participants, understand the timescale required (P1).
    • We can predict the approximate user base size & time required to reach conclusive and statistically significant results (P2).
  • Measure experiment results
    • Users can automatically track in experiment dashboard a single target number for experiment and control group (as well as deltas between the two) (P1).
    • Normalize experiment results (i.e. if control/experiment groups have different # of users and we're measuring a volume metric, we account for the size of each group) (P1).
  • Statistical significance
    • Users can see a binary validation that experiment is statistically significant based on a heuristic (P3).

Additional considerations

  • Support only A/B testing for now (i.e. no multivariate).
  • Problems P1-P2 can be treated as roofshot goals & P3, moonshot.
  • Full roadmap for what this milestone could look like after this coming up.
@macobo
Copy link
Contributor

macobo commented Nov 30, 2021

Some questions:

  • Experiments will be special 'kinds' of feature flags right? Or how will users integrate/how will we show them?
  • Setting target metrics: We're only selecting what metric we're targetting right, not the target number we want to reach right?
  • Experiment dashboard: Will this be a normal dashboard, or will we expose this on the experiment creation/editing page?
  • Can you edit experiments as they're ongoing? What happens to measurements that have already been done?
  • Regarding time required + statistics - is there a particular method we'll be using? Will we have all the relevant pieces of data?
  • Is there any prior art we should pay attention to (e.g. other solutions from other products), etc?

@marcushyett-ph
Copy link
Contributor

Feedback

  • "Set a target goal metric" seems like the fundamental premise for an experiment - why is this a P3?
  • Only completing the P1s and P2s (from scope) feels like quite unambitious and not much more valuable than FFs today - essentially renaming a feature flag and calculating how long you reed it to run for. What are the really critical things we need to solve to make this work end-to-end?
  • I feel the scope is a little prescriptive and details in terms of what we should build rather than what a good outcome would look like (e.g. instead of all the bullets for "basic experiment planning" we could say something like "Users can create an experiment with a target metric, assign participants, understand the timescale required") - we'll need to go into these details at some point - but this feels too detailed for the scope.

Questions

  • What are the major things we're leaving out of the MVP that we expect in the "full version" - apart from multi-variate?
  • To keep us thinking one step ahead, what do we think the next goal might look like after this one?

Nits:

  • "Full E2E MVP for an A/B testing suite" could be simplified to "A/B testing suite MVP"
  • "Experiments must be statistically significant." -> "Experiment results must be statistically significant"

@marcushyett-ph
Copy link
Contributor

@macobo

Can you edit experiments as they're ongoing? What happens to measurements that have already been done?

  • From experience making changes to an experiment whilst its running can end in a disaster quite easily, we need to be super-careful that only the people in the test group are ever exposed - if the test group changes we factor that into how we measure the results. I'd advocate for making it immutable, but deletable for now.

Regarding time required + statistics - is there a particular method we'll be using? Will we have all the relevant pieces of data?

  • I would use standard power calculations - we'll need to know 3 pieces of info:
    • Statistical significance threshold - we can default to 95%
    • Sample size - we should be able to estimate this from how many unique users have triggered the target event over the past week or so,
    • Magnitude of effect - [This also answers your question about the target for the metric] - i.e. if you plan to 10x the metric, you can measure it with a smaller sample size
      • I guess we could pre-fill sample size and stat-sig threshold and allow them to adjust the "target level" which will make the required time to run the experiment shorter or longer.

Is there any prior art we should pay attention to (e.g. other solutions from other products), etc?

  • I would recommend checking out what https://statsig.com do - its essentially a replica of what Facebook use internally - it's probably a bit overkill in terms of what we're trying to build with the MVP but gives a good idea of what this stuff looks like at the top end.

@neilkakkar
Copy link
Contributor

neilkakkar commented Nov 30, 2021

Phrasing it differently, is the MVP we want to build something like:

We have a user who wants to run experiments over a funnel onlyor a single-entity trend graph. The goal is either improving the conversion rate, or the trend count. The variant is FF = true, and control is FF = false. This is what exists. Now, the problem is making experimentation easy for this user. That is, making it easy to solve the problems:

  1. Answering how long should the experiment run for, given parameters
  2. Answering whether the experiment produced statistically significant results
  3. (Are there any other questions users want answered?)

Judging from the above problems, it seems like UX/UI isn't going to be a big deal, so we can target building the functionality quickly.

I feel if we start from here, i.e. the subset of problems (the core) the MVP should solve, we can probably come up with a better/smaller implementation that targets just these problems. I personally don't know enough about how users are going to interact with experimentation to solidify an approach right now, hence would target this MVP for getting tangible real-world experimentation feedback.


I'm spending a bit more time looking into background material / what other companies do - this morning, to get a better idea of what to build. Thanks for the rec, Marcus! Would also be curious to hear how you & Paolo have been using tools like these in the past, to better refine the problems we want to solve?


Edit: I obviously have glossed over a lot of the implementation complexity here (and short-circuited some complexity, like manually choosing people for the experiment), but that's to come right after this^.

@neilkakkar
Copy link
Contributor

neilkakkar commented Nov 30, 2021

offhand note inspired by checking out Statsig: When it comes to existing A/B testing platforms, we can be fundamentally different in how we approach things, because we already have the context needed to make decisions.

For example, every stand-alone A/B testing platform has a step in the process where you define a target metric / choose how users are sorted to be pass or fail based on some property (and then sync this with your implementation, so those users see the thing you expect them to see, outside of this platform).

I think we can skip this completely, because our target metrics naturally come out of the insights users create. And how users are segregated comes naturally out of the defined FF. (This also has the added fun factor where users can see current values over time of exactly what they're going to optimise, because its an insight graph)

This implies trying to copy most existing platforms is less useful for us, than taking inspiration from the problems they've solved well, and integrating that into our context (which has some other problems solved automagically).

As Paolo mentioned elsewhere, Blaze does this well. They take the best of the platform they already have, and just introduce the few extra steps necessary to run an A/B test.

@EDsCODE
Copy link
Member

EDsCODE commented Nov 30, 2021

Experiments will be special 'kinds' of feature flags right? Or how will users integrate/how will we show them?

+1 to this consideration. We're going to need to consolidate or clarify what experiment flags are vs regular feature flags—whether we show them together or not otherwise we'll end up with multiple places where you can do almost the same thing

To keep us thinking one step ahead, what do we think the next goal might look like after this one?

Most powerful moves here would be leveraging everything that we have that A/B testing specific companies might not.

  • If the user experimenting has already been using posthog besides experimentation, we can search for other signals that the experimentation might be affecting. I'm not very familiar with how extensive A/B testing platform capabilities are but i'm guessing we're often going to have way more data than our competitors one specific platforms so if we can digest that information and look for downstream affects that someone's experiment is causing, this could prove incredibly valuable beyond the target metric that a user "thinks" is important.
  • we have user histories to work with. There could be some interesting work around how users are selected based on their activity across the app. Power users of X or Y and how they convert in this new area

@paolodamico
Copy link
Contributor Author

Thanks everyone for the feedback, there are many things going on I don't want this to become hard to follow, so will only tackle the most important points here. Please also see updated initial description.

  1. I'll prepare a roadmap for what the rest of the milestone could look like. There will still be a lot of problems to tackle even after this (e.g. reach a conclusion based on target goals, secondary metrics / check for regressions, mutually exclusive experiments, ...). @marcushyett-ph
  2. Based on the standup conversation yesterday, I believe the point of this issue is to discuss scope of both the problem and solution (hence why it starts being more prescriptive), even with that @liyiy just mentioned today this still feels ambiguous. In any case, updated main description to better outline outcomes. @marcushyett-ph
  3. I don't think P1 & P2 is unambitious. We've had the feedback from users how experiment planning is key and there are a lot of considerations to make it correctly. Proper automated tracking can also have a lot of nuances. In addition, only 2 engineers are working on this, not the entire team. @marcushyett-ph
  4. Re @neilkakkar, on questions the user wants answered: please see updated description.
  5. Experiments will be special 'kinds' of feature flags right? Or how will users integrate/how will we show them?"

  6. Setting target metrics: We're only selecting what metric we're targeting right, not the target number we want to reach right?

    • Yes for this initial stage, but we'll probably want to come up with a way for users to define this so we can automatically tell them the ternary experiment outcome (success, failure, inconclusive).
  7. @neilkakkar aligned with the approach! I think we can start building this functionality and figure out the UX as we go along.

@neilkakkar
Copy link
Contributor

neilkakkar commented Dec 2, 2021

Took me a while to gather all required information, but in regards to the actual test calculations, I think taking a frequentist approach (which has been the standard for 2000-2016(?)) is the wrong way to go.

Reasons why

  1. Most people aren't familiar with statistical tools & what precisely they mean, which leads to a lot of problems in interpreting results, and lots of heuristics like: "Don't peek at the results before the experiment ends", "Decide sample size in advance"
  2. Most people at PostHog who haven't taken a math course (probably everyone except @marcushyett-ph & moi) would have a hard time keeping the above heuristics in mind while trying to interpret results. We should make things easier in the product vs offloading all of this to users. That was the whole point of it: make A/B testing easy.

Supporting Information

So you don't have to fumble through everything the world knows about A/B testing for 3 days, here's some links & choice quotes:

Probably the most famous: How not to run an A/B test

Although they seem powerful and convenient, dashboard views of ongoing A/B experiments invite misuse. Any time they are used in conjunction with a manual or automatic “stopping rule,” the resulting significance tests are simply invalid.

[...] anyone running web experiments should only run experiments where the sample size has been fixed in advance, and stick to that sample size with near-religious discipline.

What to do then?

Go Bayesian. This is a very similar approach to what we did with correlations (likelihood odds over correlation score / confusion Matrix).

One of the most succinct explanations I've found: Frequentist vs Bayesian Approach to A/B Testing

The industry is moving toward the Bayesian framework as it is a simpler, less restrictive, more reliable, and more intuitive approach to A/B testing.

Some big cutting edge companies that have moved to a Bayesian approach: VWO , Dynamic Yield.

Effectively, this doesn't require us to control sample sizes, the longer the experiment runs, the better the results.

And, we get one number that literally everyone can understand: What's the probability that A has higher conversion rate than B?

(The actual math behind it is related to Beta Distributions: Closed form, Monte-carlo simulations (Bayesian Statistics the fun way, ch 15), and a threshold for caring.

I plan on implementing to see the difference between these few^ (going to be more of a UX thing, the results should be almost the same).

With the monte-carlo approach, we can also have a precise answer over how much better is B over A. Something like:

image

(which basically means B is 1.5 times better than A, most of the times)

And we can pair this with the 3rd approach if we want users to be able to customise a threshold for caring.


cc: @paolodamico @marcushyett-ph @kpthatsme @samwinslow - keen to hear thoughts of the Growth team on this as well :)

More supporting information for doing your own research:

https://github.com/gregberns/ABTesting - has a couple of good links to things to read

https://www.dynamicyield.com/lesson/introduction-to-ab-testing/ - this is a full blown course

Against bayesian A/B testing -(Basically saying "I'm a frequentist, and I can't set confidence intervals / choose statistical power - because those are the things I'm familiar with"). This is a valid criticism if we want to target people well versed with these A/B testing parameters, who've lived their life avoiding the gotchas present here. Going the bayesian route basically means learning a new thing, which can be off-putting.

VWO Whitepaper

@EDsCODE
Copy link
Member

EDsCODE commented Dec 2, 2021

Awesome summary!

I haven't read every link yet but the bayesian approach does seem better suited for decision making with approximate information rather than perfect. What are you imagining would be the process for defining an experiment? Would you just determine the variants and let it run until the threshold is crossed?

@neilkakkar
Copy link
Contributor

This has been done partly offline & online - #7462

For defining an experiment, you select a FF & create a funnel insight (which becomes the metric you're optimising for. For a code explanation - #7492 - check out the test) - and there's a mockup by Paolo that's slightly out of date here: #7462 (comment)

The clever bit is that the experiment results = Breakdown of funnel based on FF.

The experiment ends whenever a user wants it to end, I'd say. We could do it based on a threshold as well - haven't given much thought yet to experiment end conditions.

@samwinslow
Copy link
Contributor

I like the Bayesian approach. Thank you for collecting all these resources @neilkakkar, and for deeply comprehending the math 😅

After reading the "Issues with Current Bayesian Approaches to A/B Testing" doc, many of the author's points seem to favor rigor over pragmatism. For example:

it is not the same thing to claim that “I do not know anything about those hypothesis, aside model assumptions” and “the probability of each of those hypothesis being true is equal”.

The above may be true, but among software startups, A/B testing is probably employed because the decision-maker's prior belief for the success of a particular product change is roughly 50/50 — if it was considerably far away from 50/50 they would just say #biasforaction and execute the change without the overhead of running a test.

I will flip the conversation around, then, to ask who we're building this feature for. Because if it's most software startups / SaaS businesses, Bayes would work well, despite a small learning curve, because like Eric said it is built to accommodate incomplete information. But we may want to do some more digging however into what life sciences or fintech customers expect.

Folks in those industries are generally quite particular and opinionated about the subtleties of the statistical measures they use. The preference for confidence intervals and such is especially strong in life sciences, and I don't know if that attitude extends into their digital product teams as well.

No matter what, documentation and a user guide will be crucial to provide answers to a feature which will certainly cause our users to ask some questions. I could see a "10 Misconceptions About A/B Testing" listicle-type blog post being popular as well.

@marcushyett-ph
Copy link
Contributor

marcushyett-ph commented Dec 3, 2021

@neilkakkar thanks for all the thought here, I'm always drawn in by doing something radically different, but we should make sure we validate it with our customers before going too far.

  • For people who are new to A/B testing: Bayesian is likely to be an easier place for them to start (I think its likely that many large enterprises won't have done A/B product testing before - as well as smaller customers).
  • For people who are experienced with A/B testing: Bayesian on its own is likely going to be a hard pill to swallow if they're very used to their existing approach

I would advocate for the go big / fail fast approach here, i.e. lets get some working version of the Bayesian experimentation out as quickly as possible and test it with customers (I can think of a few who might love this and a few who might be wedded to a more traditional approach)

@paolodamico
Copy link
Contributor Author

  • To answer @samwinslow, indeed that is the key point here. We're not really building (at least now) for users who have this strong rigor/precision requirements (more details here)
  • 👍 to a very solid guides, both for a non-technical person to understand what's going on and what to focus on and for a technical person to be able to understand exactly what's going on under the hood.
  • Re @marcushyett-ph that's the approach we landed on our last meeting. We'll do the Bayes approach and put it in front of potential users of experimentation.
  • In addition to the above, we discussed putting the Montecarlo simulation and subsequent histogram report on hold for this sprint. What we'll do is with the final mockup designs (hopefully next sprint), put it in front of users and see if a) it provides value and b) is clear enough.

@Twixes
Copy link
Member

Twixes commented Feb 16, 2022

I suppose the scope of the MVP has been figured out as it's built already!

@Twixes Twixes closed this as completed Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants