Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Squash NaN bugs and prevent them from coming back. (#6375)
Addresses some rare (but damaging) issues in production that appear to be due to NaNs in aggregator responses, very likely due to sending 0-rps updates at unlucky times and/or update-ticks running more quickly than 1s (which may occur if an update takes >2s, as tickers queue up a "tick" and their time continues elapsing whether it is read or not). These NaNs do two very problematic things: - NaNs beget more NaNs, so they tend to "stick" for large periods of time - NaNs turn into `INT_MIN` when you `int(NaN)`, which related code tends to treat as `0` (e.g. the underlying ratelimiter does this for negative limits) - and this rejects all requests The fix is essentially two changes: - fix the math to not produce NaNs/etc - pretty obviously needed - detect NaNs/etc arriving over RPC and reject them up-front - bad data like this should now switch to using fallbacks, rather than misbehaving - this implies we are distrusting our own service, but... that's probably correct when deploys roll out. it's better to be preventative, when the blast radius is large and spreads rapidly. And the rest of the changes are devoted to preventing this from coming back in the future: - probably-excessive "detect unexpected values and log them" checks scattered everywhere - these would likely be fine to trim to just the final "assign or return" check, I don't feel strongly about keeping them everywhere like this. - a relatively fancy fuzz test that runs multiple rounds of updates and checks all data for "bad" values at every step - I started out with a simpler single-round one... but multiple rounds found another flaw, and multiple-round naturally covers single-round too so it didn't seem worth keeping. - I had originally thought that this probably needed fuzzing... and yep. it did. - some additional tests to cover more error cases, and the new branches --- While I believe this change is good enough to fix the problem, and worth hot-fixing ASAP, this all deserves a later change to add some "odd behavior detected, log details" and "log all data about key X" code, for improving observability when things act weird. If we had this logging earlier, we probably could've spotted this in less than an hour from the first instance, rather than weeks later. It would also have been helpful to explain some strong load-imbalance events; I suspect they are "valid" and reasonable behavior (weights heavily skewed because requests were heavily skewed), but it's hard to prove or disprove that without fine-grained data, and that also means it's hard to help our customers understand things. --- Fuzz testing specifically, I've included some of the failed runs while I was developing it, to seed the fuzzer, and these are run as part of normal `go test` runs so they also help ensure coverage. Some of the scenarios took ~20 minutes of fuzzing to catch though, so as is normal with fuzz testing: If you make changes and you are not totally confident in them, fuzzing for *hours* is sorta the baseline expectation. Though anything is better than nothing. And please commit any failed runs, if they're not super (unnecessarily) large!
- Loading branch information