Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we implement resilient feature flags #13601

Closed
6 of 9 tasks
neilkakkar opened this issue Jan 9, 2023 · 7 comments
Closed
6 of 9 tasks

How do we implement resilient feature flags #13601

neilkakkar opened this issue Jan 9, 2023 · 7 comments

Comments

@neilkakkar
Copy link
Contributor

neilkakkar commented Jan 9, 2023

The what & why: PostHog/meta#74

The how: this issue. Going deep into exactly how we'll implement this and reduce the risk of everything blowing up. This touches a very sensitive code path, so making sure there's zero downtime is important.

This issue seeks to clarify for everyone how we'll get there (and for me to think through how to do it).

Broadly, the things we need to do are:

  1. Introduce caching on decide, for 2 things: (1) Project token to team. (2) teamID to feature flag definitions.: feat(flags): Enable caching for resilient responses #13708

  2. Figure out how to update caches & when to invalidate. Open question: How do we ensure caches are always populated? It's going to be annoying if cache isn't populated and postgres goes down, leaving us to die.

  3. Figure out the code paths: do we always default to cache first, or keep the cache just as a backup? Depends partially on the above & the guarantees we have on the cache.

  4. Figure out the semantics of 'best-effort flag calculation': given that postgres is down, what all flags do we want to calculate & how will this work?

  5. Update client libraries to use the new decide response & update only flags sent by decide, keep the old ones as is, unless there were no errors during computation, in which case replace all flags.: feat(feature-flags): Allow upserting decide v3 responses posthog-js-lite#53

  6. .. And don't obliterate flags on decide 500 responses.: feat(feature-flags): Allow upserting decide v3 responses posthog-js-lite#53


For reducing risk, it makes sense to break down server changes into discrete independent parts

  1. project API key -> teamID caching
  2. teamID -> flag definitions caching
  3. best effort flag evaluation
@neilkakkar
Copy link
Contributor Author

neilkakkar commented Jan 9, 2023

re: caching, I'm thinking of introducing post-save/update hooks, which ensure that when a flag is updated, the cache is updated as well.

These can sometimes fail, which leads to the cache being out of date, which isn't great. We could introduce a ttl for this, but the new problem becomes: if the ttl is too low (like, say 5 minutes); then the DB going down means we're going to go down anyway, as the information will be lost. If it's too high, chances are things will be stale for a longer while 🤔 .

It's probably better to have this in the update/create flow itself, guaranteeing that the request is a success only when the cache is updated too. Yep, this seems better. We can have longer ttls with this as well.

Making some constraints explicit:

  1. We can't really go for a cache-aside strategy, where we read from the cache, and fallback to the db if it misses (well, not by default, and not all the time at least), since the point is to defend against the db going down sporadically because too many connections etc. etc.
  2. Given the above, we necessarily want to populate the cache on startup.
    3. We can possibly subvert this by relaxing the constraint above & treating this like a regular cache. But that is not a worthy trade-off imo, as it destroys the guarantee we were looking for in the first place.

Do we need TTLs at all then? Not really, since there's no big risk of things going out of sync.

@neilkakkar
Copy link
Contributor Author

regarding size limits, the current feature flag table (which will effectively be cached) is less than 3 MB in size:

SELECT pg_size_pretty(pg_total_relation_size('posthog_featureflag'))

so, we're good here size wise for a long time.

project token to teamID is even smaller, at O(number of teams)

@ellie
Copy link
Contributor

ellie commented Jan 13, 2023

Have we considered caching outside of our main app deployment? Ie in the case that our app is totally down (lb failure/misconfig, dodgy deploy, uncaught logic problem or otherwise), customers can still resolve flags for the last state they were in

There's definitely a bunch we could do within AWS that could make this incredibly resilient

@neilkakkar
Copy link
Contributor Author

Great idea! Haven't yet, but I expect it to be a lot more plug-and-play, changing redis servers once the basic code is in place (correct me if I'm wrong!)

At that stage, would love some support from infra to making this more robust.

@neilkakkar
Copy link
Contributor Author

Ah, wait, no, if the entire app deployment is down, /decide api endpoint is down too, so the above doesn't help 🤔 .

Isn't this effectively then having a second app deployment? Since we can't/don't want to cache responses, but the flag definitions.

@neilkakkar neilkakkar changed the title WIP: How do we implement resilient feature flags How do we implement resilient feature flags Jan 19, 2023
@ayr-ton
Copy link

ayr-ton commented Feb 24, 2023

Do you think updating the flutter sdk to support normal flags and the new decide can be part of the tasks?
There is an open issue about it: #12222

@neilkakkar
Copy link
Contributor Author

A PR is already out for that, should be going in soon 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants