Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle "FatalError" PUSH errors #1184

Closed
vladsud opened this issue Feb 6, 2020 · 4 comments · Fixed by #1231
Closed

Handle "FatalError" PUSH errors #1184

vladsud opened this issue Feb 6, 2020 · 4 comments · Fixed by #1231
Assignees

Comments

@vladsud
Copy link
Contributor

vladsud commented Feb 6, 2020

New class of disconnection errors exists that we are not taking proper care - shows up as "Disconnect: Error: FatalError" reason for fluid:telemetry:Container:ConnectionStateChange_Disconnected

Per Gary, this happens when PUSH can't persist ops to storage.
While this is likely broad category, right now all issues like that happened due to SPO losing ops.
So the right approach is

  1. Make this critical error in SPO driver (flip canRetry to true where we treat "server_disconnect")
  2. Add new type of error (i.e. make it typed - see IError & specific types like IThrottlingError), and work with Scriptor to handle it properly.

@GaryWilber - FYI

@vladsud
Copy link
Contributor Author

vladsud commented Feb 6, 2020

@GaryWilber , is there any way to have unique code for that issue? I think when I looked at telemetry I did not see any, but I might be wrong. Currently, we do not make decisions based on strings, only based on error code, it would be great to keep that.

@GaryWilber
Copy link
Contributor

The code we return for FatalError is 500. You should use that to make decisions.

@vladsud
Copy link
Contributor Author

vladsud commented Feb 6, 2020

Is this the only cause for 500? i.e. does 500 always mean FatalError?
Wes - it might be worth to build a "validation" layer, where we have both error code & string and validate (telemetry only) that our expectations are matching, but use status code for all decision making

@GaryWilber
Copy link
Contributor

Yea it's the only cause for 500.

This was referenced Nov 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants