Return nil for strange FT 500 errors #12948
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
On occasion, the
v0/feature_toggles
endpoint returns 500 errors in large groups. See this example of 368 errors. This also happens over a random cadence that doesn't match up to any specific patterns (We've checked RDS connectivity issues, networking, timeouts, etc)I added some logging around the issue and the response is that
session[:token]
is ALWAYS empty and when we do the Session.find(session[:token]), it returns 1 for some reason. It's possible this is a bug in the underlying Redis Namespace gem, etc.The conditional added here will check if the expected response is not a hash AND if the
session[:token]
is empty, it will return nil. This will cover the case that we are experiencing.After doing some investigation, this has been occurring for 7+ months according to Sentry. See comment in ticket here, so it doesn't appear to be an issue that was introduced during the EKS rollout.
Related issue(s)
https://app.zenhub.com/workspaces/platform-tech-team-1-633af4074573d06c3cda142a/issues/gh/department-of-veterans-affairs/va.gov-team/57316
Research
It's also possible that this is somehow related to the session reset (maybe a race condition where the session is being reset, but the feature_toggles endpoint is hit at the same time?) During the this example where FT errors are seen, there are MANY reset_session logs.