-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Resilient feature flags #74
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
# Request for comments: Resilient feature flags | ||
|
||
## Problem statement | ||
*Who are we building for, what are their needs (include example cases), why is this important?* | ||
|
||
We've recently realised a few limitations with feature flags when PostHog goes down. This isn't great for permanent feature flags as it causes things to break | ||
in unexpected ways & forces clients to think really hard about defaults & write code defensively. | ||
|
||
Ideally, our client libraries and APIs should 'just work', even if there's no error handling on the user's side. | ||
|
||
So, how do we make this happen? | ||
|
||
## Context | ||
|
||
The (simplified) way things currently work is: | ||
|
||
Whenever a user shows up on a page, we send a request to `/decide` with their randomly generated anonymous ID to parse feature flags. Usually these people don't have | ||
properties attached to them, so only flags that don't depend on user properties resolve to `true` (assuming they're within the rollout bound). | ||
|
||
Things change when we call `identify()` on a user, as this updates their ID, and fires another request to `/decide` to get new flags based on not only current | ||
user properties, but any past properties that were set for this user. | ||
|
||
We need the database to: | ||
(a) Validate project tokens and find relevant team | ||
(b) Get feature flag definitions for the given team | ||
(c) Evaluate above flags using the person's properties stored in the DB | ||
|
||
Note that `decide` can run on any arbitrary anonymous ID. | ||
|
||
Caching decide responses doesn't make too much sense, since it's called usually only once per distinct ID per session. Every anonymous user has a new ID, and except | ||
for users coming back to the app again & again on the same day, we don't achieve resiliency as we can't cache for users never seen before, nor can we assume that properties | ||
haven't changed; while introducing new problems like cache invalidation. | ||
|
||
### Possible failure modes | ||
|
||
1. The database is down/unreachable: `decide` evaluation fails and returns 500 | ||
2. The servers are down/unreachable: requests from client libraries time out / error out. | ||
|
||
(2) seems hard to defend against without a distributed app distribution (and then a resolver to go to the 'correct' app), but (1) is possible. | ||
|
||
## Solution | ||
|
||
Introducing, decide v3. | ||
|
||
This is a good opportunity to fix other customer frustrations around this endpoint, and pave the way for breaking changes we've wanted to introduce for a while. | ||
|
||
In short: | ||
|
||
1. Decide v3 returns all flags, with a `true` or `false` based on whether they're enabled or not, vs. implicitly just not returning the false flags. | ||
2. Decide v3 does a best effort calculation of flags when PostHog DBs are down, and tells the client that it faced issues computing all flags. | ||
3. Decide v3 can return a JSON payload with the flag variant (sneaking this into the major version API update) | ||
|
||
On the SDK side: | ||
|
||
1. We stop obliterating all flags if decide returns a 500, and based on the parameter in (2), update only flags that were computed again, while keeping old values of other flags. | ||
2. We add timeouts to flushes & shutdowns | ||
|
||
To make this possible, we cache flag definitions (and not the decide responses) on our side, so these are available even if the DB is down. We need to do the | ||
same for determining the team given a project API key. | ||
|
||
It makes sense to include (3) with (1) for decide because the shape of the response is changing anyways, and it's much easier to handle both at the same time in all client libraries, than doing these one by one. | ||
|
||
We don't have to build the rest of it out right now, but having this functionality here means when do build it, it will be extraordinarily faster to do. | ||
|
||
## Is it worth doing this? | ||
|
||
If PostHog never goes down, none of this gives us an advantage (except maybe speeding up `/decide` responses if we decide to put the cache in front). | ||
|
||
neilkakkar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
However, for the times we do go down, doing this ensures that our customers don't have a terrible experience & don't have their apps break. It's a trust-building | ||
activity that ensures customers stick around. | ||
|
||
It's an extremely big churn risk. If PostHog going down makes their app go down, we become an extra dependency for them being up. | ||
|
||
## Context | ||
*What are our competitors doing, what are the technical constraints, what are customers asking for, what does the data tell us, are there external motivations (e.g. launch week, enterprise contract)?* | ||
|
||
## Design | ||
*What are the key user experience and technical design decisions / trade-offs?* | ||
|
||
## Sprints | ||
*How do we break this into discrete and valuable chunks for the folks shipping it? How do we ensure it's high quality and fast?* |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is a distributed app distribution and what is a 'correct' app?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talk about terrible wording 😅 . I basically mean multiple server deployments where one going down doesn't affect the other, like an edge server; and a load balancer than can appropriately link to healthy closest server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think we should build an SDK making an assumption on the underlying infrastructure.
To build resilient software, use defensive programming: always hope for the best (uptime 100%) but always prepare for the worst (everything is down). AKA: a multi-geo deployment might help you, but you should not rely on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not assuming that at all, I'm just listing out what are ways things can go wrong here.
Even with defensive programming, the issue still remains that when servers go down, you stop getting flag information.
(a point below addresses this)