Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Add guidance on ENVOY_BUG in STYLE.md #14575

Merged
merged 8 commits into from
Jan 21, 2021
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 72 additions & 29 deletions STYLE.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,23 +106,31 @@ A few general notes on our error handling philosophy:

* All error code returns should be checked.
* At a very high level, our philosophy is that errors should be handled gracefully when caused by:
- Untrusted network traffic OR
- Untrusted network traffic (from downstream, upstream, or extensions like filters)
- Raised by the Envoy process environment and are *likely* to happen
- Third party dependency return codes
* Examples of likely environnmental errors include any type of network error, disk IO error, bad
data returned by an API call, bad data read from runtime files, etc. Errors in the Envoy
environment that are *unlikely* to happen after process initialization, should lead to process
death, under the assumption that the additional burden of defensive coding and testing is not an
effective use of time for an error that should not happen given proper system setup. Examples of
these types of errors include not being able to open the shared memory region, system calls that
should not fail assuming correct parameters (which should be validated via tests), etc. Examples
of system calls that should not fail when passed valid parameters include the kernel returning a
valid `sockaddr` after a successful call to `accept()`, `pthread_create()`, `pthread_join()`, etc.
* OOM events (both memory and FDs) are considered fatal crashing errors. An OOM error should never
silently be ignored and should crash the process either via the C++ allocation error exception, an
explicit `RELEASE_ASSERT` following a third party library call, or an obvious crash on a subsequent
line via null pointer dereference. This rule is again based on the philosophy that the engineering
costs of properly handling these cases are not worth it. Time is better spent designing proper system
controls that shed load if resource usage becomes too high, etc.
data returned by an API call, bad data read from runtime files, etc. This includes configuration
updates after startup.
* Third party dependency return codes should be checked and gracefully handled. Examples include
HTTP/2 or JSON parsers. Some return codes may be handled by continuing, for example, in case of an
out of process RPC failure.
* Errors in the Envoy environment that are *unlikely* to happen after process initialization, should
lead to process death, under the assumption that the additional burden of defensive coding and
testing is not an effective use of time for an error that should not happen given proper system
setup. Examples of these types of errors include not being able to open the shared memory region,
system calls that should not fail assuming correct parameters (which should be validated via
tests), etc. Examples of system calls that should not fail when passed valid parameters include
the kernel returning a valid `sockaddr` after a successful call to `accept()`, `pthread_create()`,
`pthread_join()`, etc. However, system calls that require permissions may cause likely errors in
some deployments and need graceful error handling.
* OOM events (both memory and FDs) or ENOMON errors are considered fatal crashing errors. An OOM
asraa marked this conversation as resolved.
Show resolved Hide resolved
error should never silently be ignored and should crash the process either via the C++ allocation
error exception, an explicit `RELEASE_ASSERT` following a third party library call, or an obvious
crash on a subsequent line via null pointer dereference. This rule is again based on the
philosophy that the engineering costs of properly handling these cases are not worth it. Time is
better spent designing proper system controls that shed load if resource usage becomes too high,
etc.
* The "less is more" error handling philosophy described in the previous two points is primarily
asraa marked this conversation as resolved.
Show resolved Hide resolved
based on the fact that restarts are designed to be fast, reliable and cheap.
* Although we strongly recommend that any type of startup error leads to a fatal error, since this
Expand All @@ -148,20 +156,55 @@ A few general notes on our error handling philosophy:
continue seems ridiculous because *"this should never happen"*, it's a very good indication that
the appropriate behavior is to terminate the process and not handle the error. When in doubt,
please discuss.
* Per above it's acceptable to turn failures into crash semantics
via `RELEASE_ASSERT(condition)` or `PANIC(message)` if there is no other sensible behavior, e.g.
in OOM (memory/FD) scenarios. Only `RELEASE_ASSERT(condition)` should be used to validate
conditions that might be imposed by the external environment. `ASSERT(condition)` should be used
to document (and check in debug-only builds) program invariants. Use `ASSERT` liberally, but do
not use it for things that will crash in an obvious way in a subsequent line. E.g., do not do
`ASSERT(foo != nullptr); foo->doSomething();`. Note that there is a gray line between external
environment failures and program invariant violations. For example, memory corruption due to a
security issue (a bug, deliberate buffer overflow etc.) might manifest as a violation of program
invariants or as a detectable condition in the external environment (e.g. some library returning a
highly unexpected error code or buffer contents). Unfortunately no rule can cleanly cover when to
use `RELEASE_ASSERT` vs. `ASSERT`. In general we view `ASSERT` as the common case and
`RELEASE_ASSERT` as the uncommon case, but experience and judgment may dictate a particular approach
depending on the situation.

# Macro Usage

* The following macros are available:
- `RELEASE_ASSERT`: fatal check.
- `ASSERT`: fatal check in debug-only builds. These should be used to document (and check in
debug-only builds) program invariants.
- `ENVOY_BUG`: logs and increments a stat in release mode, fatal check in debug builds. These
should be used where it may be useful to detect if an efficient condition is violated in
production (and check in debug-only builds).
asraa marked this conversation as resolved.
Show resolved Hide resolved

* Sub-macros alias the macros above and can be used to annotate specific situations:
- `ENVOY_BUG_ALPHA` (alias `ENVOY_BUG`): Used for alpha or rapidly changing protocols that need
detectability on probable or changing invariants.
asraa marked this conversation as resolved.
Show resolved Hide resolved

* Per above it's acceptable to turn failures into crash semantics via `RELEASE_ASSERT(condition)` or
`PANIC(message)` if there is no other sensible behavior, e.g. in OOM (memory/FD) scenarios.
* Only `RELEASE_ASSERT(condition)` should be used to validate conditions that might be imposed by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think making this more precise would make sense, since the goal of this PR is to avoid any ambiguity around things like "external environment" (which is kind of clarified above but lost to the reader by this point).

the external environment.
* Annotate macro usage with comments on belief (probable or provable condition).
asraa marked this conversation as resolved.
Show resolved Hide resolved
* Use `ASSERT` and `ENVOY_BUG` liberally, but do not use it for things that will crash in an obvious
asraa marked this conversation as resolved.
Show resolved Hide resolved
way in a subsequent line. E.g., do not do `ASSERT(foo != nullptr); foo->doSomething();`.
* `ENVOY_BUG`s provide detectability and more confidence than an `ASSERT`. They are useful for
non-trivial conditions, those with complex control flow, and rapidly changing protocols. Testing
should be added to ensure that Envoy can continue to operate even if an `ENVOY_BUG` condition is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can enforce this in CI. I.e. in a coverage-ish job, ensure that every ENVOY_BUG fires at least once across all tests. I think this is the only way we will get reliable enforcement of this requirement. One thing to keep in mind is that this is going to force gymnastics in testing; e.g. we will need things like testing peers to be able to force invariant violations, as ENVOY_BUG covers things that are usually beneath the intentionally exposed API surface.

Copy link
Contributor Author

@asraa asraa Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I see -- these places are going to be difficult to test by design.
My other concern is that if we can't use llvm-cov, it may be difficult to design a job that does it. My only idea is that we grep for the error string in an ENVOY_BUG and check that it appears in a test file as a log/crash message.
#14766

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fine to make this a followup. I think if we don't have some enforcement here, we are going to reach an end state in which some reasonable fraction of ENVOY_BUG that should have tests are missing this.

violated.
htuch marked this conversation as resolved.
Show resolved Hide resolved
* `ASSERT`s should be understandable to a reader. Add comments if not. They should be robust to
future changes.
* Note that there is a gray line between external environment failures and program invariant
violations. For example, memory corruption due to a security issue (a bug, deliberate buffer
overflow etc.) might manifest as a violation of program invariants or as a detectable condition in
the external environment (e.g. some library returning a highly unexpected error code or buffer
contents). Unfortunately no rule can cleanly cover when to use `RELEASE_ASSERT` vs. `ASSERT`. In
general we view `ASSERT` as the common case and `RELEASE_ASSERT` as the uncommon case, but
experience and judgment may dictate a particular approach depending on the situation.

Below is a guideline for macro usage. The left side of the table has invariants and the right side
has error conditions that can be triggered and should be gracefully handled. `ENVOY_BUG` represents
a middle ground that can be used for uncertain conditions that need detectability. `ENVOY_BUG`s can
also be added for errors if they warrant detection.

| `ASSERT`/`RELEASE_ASSERT` | `ENVOY_BUG` | Error handling and Testing |
| --- | --- | --- |
| Low level invariants in data structures | | |
| Simple, provable internal class invariants | Complex, uncertain internal class invariants (e.g. need detectability if violated) | |
| Provable (pre/post)-conditions | Complicated but likely (pre-/post-) conditions that are low-risk (Envoy can continue safely) | Triggerable or uncertain conditions, may be based on untrusted data plane traffic or an extensions’ contract. |
| | Conditions in alpha or changing extensions that need detectability. (`ENVOY_BUG_ALPHA`) | |
| Unlikely environment errors after process initialization that would otherwise crash | | Likely environment errors, e.g. return codes from untrusted extensions, dependencies or system calls, network error, bad data read, permission based errors, etc. |
| Fatal crashing events. e.g. OOMs, deadlocks, no process recovery possible | | |

# Hermetic and deterministic tests

Expand Down