Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0119] Formalize testing for nixpkgs packages #119

Merged
merged 14 commits into from
Jun 30, 2022
105 changes: 105 additions & 0 deletions rfcs/0119-testing-conventions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
---
feature: Defined conventions around testing of official Nixpkgs packages.
start-date: 2021-12-29
author: Jonathan Ringer
co-authors:
shepherd-team: @Mic92 @Artturin @kevincox
shepherd-leader: @Mic92
related-issues:
- [RFC 0088 - Nixpkgs Breaking Change Policy](https://github.com/NixOS/rfcs/pull/88)
---

# Summary
[summary]: #summary

When updating or modifying packages, several conventions for testing regressions
have been adopted. However, these practices are not standard, and generally it's not well
defined how each testing method should be implemented. It would be beneficial to have
an unambiguous way to say that a given package, and all downstream dependencies, have
had as many automated test ran possible. This will give a high degree of certainty that
a given change is less likely to manifest regressions once introduced on a release
channel.

Another desire of this rfc is also to have a way for various review tools
(e.g. ofborg, hydra, nixpkgs-review) to have a standard way to determine if a
package has additional tests which can help verify its correctness.

# Motivation
[motivation]: #motivation

Breakages are a constant painpoint for nixpkgs. It is a very poor user experience to
have a configuration broken because one or more packages fail to build. Often when
these breakages occur, it is because the change had a large impact on the entirety
of nixpkgs; and unless there's a dedicated hydra jobset for the pull request, it's
infeasible to expect pull request authors to verify every package affected
by a change they are proposing. However, it is feasible to specify packages that
are very likely to be affected by changes in another package, and use this information
to help mitigate regressions from appearing in release channels.

# Detailed design
[design]: #detailed-design
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe expand on what are the intended changes to the status quo?

passthru.tests is a name documented inside the manual, however nixosTests are recommended to be also put there.

(also, if sorting by resource consumption, maybe this split is not needed?)

Are we encouraged to compromise on something in the name of more test coverage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I wasn't sure what the status quo should be. My current thoughts are, "here is some addtional metadata you can add to ensure that people know how your package may break. Or add your package to the tests of other packages to ensure it's not broken."


Copy link
Member

@Artturin Artturin Apr 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recently people have started adding sensitive downstream dependencies passthru.tests
Example https://github.com/NixOS/nixpkgs/pull/167092/files

Cc @risicle

I propose adding a new section called testReverseDeps or so

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this technique was @jonringer 's idea in the first place 😉

The main thing I'm slightly uncomfortable with is adding the reverse dependencies as arguments - I imagine it causing wierdness for people who do a lot of overriding. But I can't think of a sensible way it could be avoided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the benefit is for nixpkgs testing. For the overriding case, it should be that an override would not trigger a downstream dependency to influence the build as passthru gets pruned before instantiation.

I agree it's odd that a package is aware of it's downstream dependencies, but I'm not sure of another way of having a package be aware of them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine it causing weirdness for people who do a lot of overriding.

A potential mitigation is to source these from a pkgs argument, so they're all bundled up. Not sure if there's a problem in the first place though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usage of pkgs is frowned upon in nixpkgs from a hygiene perspective, I would rather not involve it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nixpkgs-review builds all reverse dependencies. could we just use this mechanism in a testers.testBuildReverseDeps? but there can be thousands 😬

cc @Mic92

Standardize `tests.<name>` (commonly set with the `passthru.tests.<name>` argument to `mkDerivation`) as a mechanism of
more expensive but automatic testing for nixpkgs. As well as encourage the usage of
`checkPhase` or `installCheckPhase` when packaging within nixpkgs.

Criteria for `tests.<name>`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests.<name> suggests the type attrsOf package, whereas in practice we make use of nix-build's attrset traversal capability to build NixOS tests. They're approximately attrset trees with derivations at the leaves; elaborated in the GitHub suggestion below.

Nested attribute sets of packages have already been implied earlier in the discussion, but haven't made it explicitly into the document yet.
Noteworthy advantages of nested attribute sets

  • easier to write than merges (//)
  • not susceptible to name collisions
  • allows the selection of groups of derivations by path
  • lazier. // forces the computation of all the attribute names, which may be slow when an expression library is involved

If you don't want to go into the technical details, I guess you could roughly summarize this as "tools should align with nix-build's behavior", but it seems like some elaboration of that would be expected for "Formalize testing for nixpkgs packages".
The following is not a complete description of nix-build and allows some divergence (in the mathematical sense as well), as I believe defaulting recurseForDerivations to false is counter-intuitive. Furthermore, nix-build's behavior can be exploited to limit the default set of tests that a user can be expected run, whereas an asynchronous tool can run all tests.

Suggested change
Criteria for `tests.<name>`:
Criteria for `tests.<name>`:
- Each `test.<name>` should be a derivation, or an attribute tree with derivations at the leaves.
- `nix-build -A pkg.tests` only recurses into attrsets below `tests.<name>` that also have the attribute `recurseForDerivations = true`, as set by `recurseIntoAttrs`.
- Other tools may recurse into attrsets that do not define `recurseForDerivations`, but must adhere to `recurseForDerivations = false;`.
- Tools may allow the user to specify an attribute, in which case the traversal starts at that attribute, regardless of any `recurseForDerivations` in the specified attribute or its ancestry.
- Function-valued attributes should be ignored.

Copy link
Contributor Author

@jonringer jonringer Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested attribute sets of packages have already been implied earlier in the discussion, but haven't made it explicitly into the document yet.
Noteworthy advantages of nested attribute sets

easier to write than merges (//)
not susceptible to name collisions
allows the selection of groups of derivations by path
lazier. // forces the computation of all the attribute names, which may be slow when an expression library is involved

I don't think it's going to be super common for the need to override tests in one package from another, which would require (//). The goal of this is to have a compromise between building everything, and building what's most likely to fail.

For example, if there's an update to openssl, then I would expect to run the downstream packages which have the most rebuilds (e.g. 1000+ rebuilds) and maybe some NixOS tests. But this will all be defined within openssl's expression, I can't think of use case where I would want to (//) more attrs into the tests attrset, I would just add them directly to openssl's passthru.tests.

From a PR reviewer's standpoint, if I see that someone has a PR updating openssl, I just want to be able to do nix build -f . openssl.tests and have enough faith that it will cover the most likely regressions. If there was a regression found later, then we can add that package to tests, and improve it over time.

My other issue with nested tests, is that these tests are meant to act more like a CI/CD. If any one of them fails, then that failure should be addressed. It's not going to be a "well the simple tests pass, but this one one failed, I guess it's fine". A test shouldn't exist if it's failure state isn't going to indicate that the changes cause regressions.

Copy link
Contributor Author

@jonringer jonringer Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I'm not opposed to nested attrsets from a technical perspective. More of a "if we get to the point where we feel it's necessary to start organizing the tests with nested attrsets", then we are probably adding a little too much complexity to the tests, and should keep the list of tests shorter to what can be expected of a PR reviewer can handle with average desktop compute resources.

Similarly, I think usage of (//) is also in this domain, and there's likely too much complexity being added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's going to be super common for the need to override tests in one package from another, which would require (//).

This is kind of what we're doing with nixosTests though. We're taking tests from one place and using it them in a package.

[> allows the selection of groups of derivations by path]

"if we get to the point where we feel it's necessary to start organizing the tests with nested attrsets"

This was a distraction. Neither of us care about organizing the tests into clever hierarchies.

I just want to throw whatever expression in there and expect all of it to run in nix-build, ofborg, etc.

By throwing out nested attrsets we need to start caring about the hierarchy, because we will have to flatten it.
If I change a test in nixosTests from a single test to a test matrix, that should have no ill effect on the packages that reference it. If we disallow nesting, it will.

My other issue with nested tests,

Well, this is about test selection again, viewing nested attrsets as a means only to that end, which it is not.

It's really about making NixOS test linking easy and robust.


I guess I could change NixOS/nixpkgs#176557 to always return attrsOf derivation; also when no matrix is defined: { test = drv; }, and flattening the attrsets when it has multiple axes. That would make the PR a breaking change rather than a refactor with added functionality, so I'll scope that out for the PR itself. I also don't want to be associated with such a needless breaking change.

Honestly though, I think it's a bad choice, as we're painting ourselves into a corner with this unnecessary restriction. Our tooling has no problem dealing with nested attrsets...

lazier. // forces the computation of all the attribute names, which may be slow when an expression library is involved

I've seen a developer question his life choices when iterating on expressions that had a false dependency on an expensive attrset's keys. The expression they iterated on had no dependency on those keys, but they had to be computed because they //-ed it to their packages attrset. This added 30 seconds to their cycle time, making unnecessary strictness take up the majority of their cycle time. They'd been passionate about Nix and they had plenty of experience, but weren't intimately familiar with the evaluator to figure out that // was ruining their workflow.

Please reconsider flattening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you were just concerned about maintaining a package, and you were just concerned about the long-term maintenance of that package? what would you want the "api" to be for the test suite?

I'm not advocating that nested attrsets should be disallowed, I'm saying they are a "code smell". Since nested attrsets already exist in nixosTests, then I think that's fine for that to carry over. And having a large test suite in nixosTests should probably not be frowned upon if the tests provide value.

I am saying that (//) and large test suite should be avoided, because the desire of this rfc to make PR acceptance more automated, and large testing requirements will get in the way of that.

- Running tests which include downstream dependencies.
- This avoids cyclic dependency issues for test suites.
- Running lengthy or more resource expensive tests.
- There should be a priority on making package builds as short as possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you! 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I originally wrote this about sage, and sageWithTests. Which took even longer :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(looks sadly at NixOS/nix#3600 being unmerged) although for a given situation with a package, speeding up separate tests on their own would also not hurt

- This reduces the amount of compute required for everyone reviewing, building, or iterating on packages.
- Referencing downstream dependencies which are most likely to experience regressions.
- Most applicable to [RFC 0088 - Nixpkgs Breaking Change Policy](https://github.com/NixOS/rfcs/pull/88),
as this will help define what breakages a pull request author should take ownership.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only place corresponding to «tests have to pass»?

(BTW what about a test removal procedure? Explicit maintainer approval?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future work, there's mentions to add it to the PR template, and it's already part of ofborg.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of this is vague enough that it remains unclear whether the RFC establishes a norm that tests should be run completely, and have to pass (hmmm, what about platform discrepancies…)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more meant to set the expectation. There's already somewhat of a convention, and I would like for that to be more explicit and expected.

nixpkgs-review goes a long way to finding regressions, but the problem is totality. For some changes, there may only be a few packages which may realistically be affected by the change, so I don't want to build 500+ packages to review the build implications of a change, I just the relevant package and the few direct downstream dependencies. Even better if there's a testcase for the package itself.

Copy link
Member

@7c6f434c 7c6f434c May 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically you do not want to implement something with the full extent of the point from the meeting notes?

Tests of the updated package should be become mandatory for acceptance of a PR

(Here my question can be treated as procedural: for the text «updated according to the meeting notes» I am fine both with the text that clearly claims what the point in the meeting notes claims, or with an explicit comment in the discussion that the meeting note point as written is too far-reaching, maybe because it is too brief of a summary of an agreed position; I just want clarity what interpretation of RFC gets covered by the eventual decision)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically you do not want to implement something with the full extent of the point from the meeting notes?

ofborg already runs passthru.tests if the commit message is formatted correctly. I think I'm mis-understanding what you're trying to say. My opinion is that the existing process already enforces the opt-in testing behavior. The main issue right now is that the usage of passthru.tests is the exception, not the norm.

BTW what about a test removal procedure?

This is a non-goal for the RFC. Personally, I believe this should be left up to the maintainer(s) to decide. The additional tests should be providing value to the review process, if they don't then they should probably be removed, but this can be decided by the maintainers.

Explicit maintainer approval?

Also a non-goal. Just concerned with the ability to test a PR, which may aide in the decision making process to merge a PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, do we ask that tests are run or that tests pass fully at least on one platform (maybe not the first try etc.…)?

Copy link
Contributor Author

@jonringer jonringer May 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, do we ask that tests are run or that tests pass fully at least on one platform (maybe not the first try etc.…)?

I think the tests should be ran; and if any fail, it should be an indicator that the package is unhealthy in context of the PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the position, and I think it could be made less vague in the RFC. Maybe put that statement that «tests should be run and a failing test is an indicator that the package is unhealthy in the context of the PR.» as a top-level statement in the detailed design?

- Running integration tests (E.g. nixosTests)
- Tests which have heavy usage or platform requirements should add the appropriate systemFeature
- E.g. `nixos-test` `kvm` `big-parallel`

Usage for mkDerivation's `checkPhase`:
- Quick "cheap" tests, which run units tests and maybe some addtional scenarios.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all test suites are quick or cheap, but running them should be a priority over quickness. If we can make running them in a separate derivation easy, that's worth considering, but it seems that the human overhead would not be worth it in the general case.
A lot could factor into this, so I think we should make this less prescriptive.

Suggested change
- Quick "cheap" tests, which run units tests and maybe some addtional scenarios.
- Preferably quick "cheap" tests, which run units tests and maybe some addtional scenarios.

- Since this contributes to the "build time" of a package, there should be some
emphasis on ensuring this phase isn't bloated.

Usage for mkDerivations `installCheckPhase`:
- A quick trivial example (e.g. `<command> --help`) to demonstrate that one (or more)
of the programs were linked correctly.
Comment on lines +65 to +66
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also testVersion introduced in NixOS/nixpkgs#121896.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Assert behavior post installation (e.g. python's native extensions only get installed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "only" in this sentence is confusing me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native extensions only get installed. However, most test suites will consume the code in the build directory. So tests will fail because the compiled extensions will not be present.

I'm not sure how to word this better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe start with the phrase about the build directory?

and are not present in a build directory)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some programs are hard to test automatically so how about creating a new meta attribute like testingInstructions

Copy link
Contributor

@kevincox kevincox Apr 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly oppose this idea.

IIUC the goal of this RFC is to make it easier for changes to upstream packages to be tested. The end goal is that we have automatic tooling that can test packages, notify maintainers of breakages and eventually mark as broken if the maintainers are unable to fix the package in time. Adding required manual testing puts unacceptable levels of burden on core package maintainers (that are depended on by hundreds or thousands of other packages).

I think a testingInstructions attribute may be an interesting and useful idea but I think it would serve a different purpose as the formalized testing specified by this RFC. If you want to create a different RFC for an informational attribute I would support it.

TL;DR I don't want to require people to manually test loads of packages, if you want your package not to break due to changes in dependencies you need automated tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess testingInstructions would fit here if it came with «replacing this with a test that looks robust will be accepted even if the test implementation is ugly», but I do not believe such a commitment would be accepted into the attribute semantics…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could record merge conditions instead?
Some packages don't need manual testing, so all that's needed for a merge is a review of the changelog, but this information has not been recorded yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have a CONTRIBUTING.md in a packages directory with documentation for package maintainers and contributors, like how to test, how to update...
In addition to that, a README.md with documentation for users.

# Drawbacks
[drawbacks]: #drawbacks

None? This is opt-in behavior for package maintainers.

# Alternatives
[alternatives]: #alternatives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative is that we consider all dependent packages as tests. We can have dependent packages that are just tests for example a testFoobar package to test foobar.

Then a PR author would responsible that all dependents build (aka pass) or are marked broken.

The obvious issue here is that for packages with lots of dependents it becomes infeasible for the average author to run a tool that builds everything and marks failures as broken. I think it is worth mentioning this alternative because this RFC demonstrates a clean way to define an appropriate sample size. Then it is expected that nixpkgs-provided build resources can be used for the full build + mark as broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and this is the compromise with passthru.tests listing downstream dependencies. The idea is to list packages which are "likely to break with breaking changes".

For example, some packages may make use of many of systemd's features, however, other packages only really use libudev, which is much more stable. We could probably forego the libudev packages, and just list the packages which are using systemd's more advanced features.


Continue to use current ad-hoc conventions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# Unresolved questions
[unresolved]: #unresolved-questions

How far should testing go?
- What consistitutes that "enough testing" was done to a package before a change was merged?

Hydra: How would this look for hydra adoption and hydraChecks?

# Future work
[future]: #future-work

One problem with onboarding more tests to the current nixpkgs CI and processes is the increased
need of compute, storage, and ram resources. Therefore, consideration of future work should
take into consideration how much testing is feasible for a given change.

Onboarding of CI tools to support testing paradigms:
- nixpkgs-review
- Run `<package>.tests` on affected packages
- Allow for filtering based upon requiredSystemFeatures
- ofborg
- Testing of `<package>.tests` is already done.

Nixpkgs:
- Add existing nixosTests to related packages
- Update testing clause on PR template
- Update contributing documentation