-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
6ba6792
commit 9e51c06
Showing
1 changed file
with
19 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
|
||
Internal post-mortem: https://github.com/PostHog/incidents-analysis/pull/58 | ||
|
||
On 28th February, PostHog Feature Flags went down for about 3 hours, from 15:30 to 18:10 UTC. | ||
|
||
Our database started stalling, which caused a small blip in service. Unfortunately, we then proceeded to DDOS ourselves as all clients tried to retry feature flags. This problem never showed up before because we had sufficient capacity to handle 5x the peak load. As flags have grown, this unfortunately isn’t true anymore. To further exacerbate this problem, our failover healthchecks for the database all returned healthy, so we kept slowly trying to service all requests. | ||
|
||
As a result, some users had their flags fallback to empty responses, and some had their applications crash because of exceedingly long timeouts. However, users using local evaluation were not affected. This is again our bad as SDKs were missing timeout configurations specific to feature flags - this ought to generally be much lower than regular capture requests. | ||
|
||
We spent way too much time trying to figure out what was wrong, which is why this incident lasted so long - this is completely unacceptable and we’ll be making several changes to make sure this doesn’t happen again. | ||
|
||
To start with, all SDKs will stop retrying feature flag requests, and have customisable timeouts specific for flags. Further, the default here will be 3 seconds, compared to the 10 second default for all other requests. This is already released for python(v3.5.0) and node(v.4.0), rest of the SDKs to follow. This should make sure that even if something unexpected goes wrong with our servers in the future, your application never goes down. | ||
|
||
Secondly, we’ll add a manual override to our db healthchecks, so we can much more quickly respond to issues like these, and bring flags back up. | ||
|
||
As a more long term solution (Q2 2024), we’ll move out feature flag requests from our Django monolith as we hit these scaling limits, and move to a more scalable system focused just on feature flag reliability. | ||
|
||
|
||
|