We think about a software and its testing in terms of "proximity to the consumer". This is a function of "how likely it is the consumer will see it". The closest and most visible layer to the consumer is the literal API definition. Beneath this surface are behaviors. Behaviors are often un-enumerable in full. So, we work backwards from Intended Patterns, running through a checklist of potential permutations and interactions.
- Intended Patterns
- Best Practices
- Plausible Usage
- Common Edge Cases
- Uncommon Edge Cases
- Exhaustive Surface Area
- Internal Unit Behavior (theoretically hidden/encapsulated behavior)
At the end of this exercise, because the space is never fully enumerable, we end up with tests that fall into two broad categories:
- Defined Behavior Tests
- Internal Unit Tests
We treat Defined Behavior Tests as the library's behavioral contract. They document, demonstrate, and stabilize library behaviors that are explicitly intended and expected to be used by consumers. Barring critical oversights, security issues, or other extenuating circumstances, these tests MUST NOT change or be removed within a major version. If changed or removed, a new major version (aka. "major version bump") is generally REQUIRED.
Tests outside of the "Defined Behavior" boundary can be considered "Internal". These tests will be copious and treated far less stringently. They can generally be thought of as "development aids". They're produced during development to sanity check internal units. They "loosely clamp" down on internal library behaviors. They can signal when you might have broken something, but they are ultimately recycleable. Their failures DO NOT strictly indicate a broken state, and they SHOULD NOT be used to reverse engineer defined behavior.
We will accumulate all Defined Behavior Testing in the integration-tests
package. Testing directories within integration-tests
will be structured and treated with a level of rigor commensurate to consumer proximity. (Normal unit testing can still occur locally in each package.)
__tests__
defined-behavior
— Everything in this folder falls into the "Defined Behavior" category. Changing or removing tests in this folder generally requires a major version bump and must be approved by a PM or admin.1-patterns
— Tests showing examples the core coding patterns and "best practices" we would recommend in documentation and Github issues. Mostly hand authored, with the exception of API documentation generation/extraction. Protected by PM, SDM, and SDE Admin approval. Treated like design. Tests look as much like customer app code as possible. In general, seek abundant approval for changes here. When possible, submit tests as failing, skipped tests first. Implementation PR's that then un-skip these tests can be less rigorous. These tests should be optimized above all else to represent code a customer would write.2-expected-use
— Tests showing expected usage variations, edge cases, and error handling that doesn't fall into normal error handling "patterns". Hand-authored in general. Should ideally look like customer app code. Snippets from these tests may be used in GH issue responses or referred to directly when answering questions around defined behavioral edge cases. Protected by PM + SDE Admin approval. These tests can cover edges, but should still be optimized to look like copy/paste-able customer code.3-exhaustive
— The extrapolation of all defined behavioral space within reason. Could be hand authored. But, in general, these tests could and should be programmatically generated according to input dimensions and combinations thereof where cross-dimensional interactions are expected. Protected by SDE Admin. These tests are optimized for high coverage; resemblance to customer code is completely optional.
internal
— Tests used primarily for development and debugging. Default PR ownership and validation. Not trash, but recyclable. ♻️- Everything else —
Leftovers from before we implemented this testing structure. We will extract defined behavior and sort the remainder into the
internal
folder over time.
(Folders are prefixed numerically to keep folder sorting consistent with "proximity to the consumer".)
When decomposing a defined behavior at any level, here's the checklist of input dimensions you should be prepared to claim you went through. 😅
- Argument permutations
- Module State (App, library, class, closure...)
- Temporal Coupling
- Dependencies (imports)
- Repetition / Idempotency
- Network conditions
- File inputs
- CLI inputs
- Environment variables
- AWS Region
- Timing
- Platform
- OS
- Android
- iOS
- Windows
- Mac
- Linux (+flavor)
- ChromiumOS
- Web (FF, Safari, Chrome, Edge, etc.)
- Runtime (e.g., node 12, 14, 16, 18 ...)
- OS
- Field type
- string
- id
- integer
- float
- enum
- custom type
For each dimension on the list, assess what behaviors the dimension exposes.
- Are the exposed behaviors likely?
- If so:
- Make a testing plan.
- Execute the testing plan.
- Explain why the tests are sufficient.
- If not:
- Try to test it anyway.
- Explain why it's unlikely and/or why it doesn't need tests.
Finally, if you think of any dimensions missing from the list, add them. If you can make a case for removing dimensions because it "never affects behavior", remove it from the list with a justification.