adds upstream testing development document #404

JamesKunstle · 2025-01-21T00:41:34Z

Adds a dev doc describing our testing strategy for this repo.

nathan-weinberg · 2025-01-21T15:59:45Z

@JamesKunstle can you fix the markdown linting? can run make md-lint for debugging locally

danmcp · 2025-01-23T02:50:29Z

docs/upstream-testing-plan.md

+
+```python
+
+for variant in ["nvidia", "amd", "intel", "cpu", "mps"]: # parallel at runner level


This is going to be a real challenge with runner availability to cover these on every PR.

Yeah, I figure that we'll scale out a little at a time, starting w/ CPU and nvidia, and adding runners to the matrix as possible.

comment here: there are no MPS runners to my knowledge. We would need to self host runners for most of these scenarios outside of CPU and CUDA

I see O(n^3) configs here. This is indeed not scalable. In some projects with similar problems I worked for, this dilemma was solved by crafting a set of "verified configurations" that were defined in such a way so that they cover the highest number of unique paths through the multi-dimentional feature space without blowing up the number of permutations so much.

For example, one could have:

NVIDIA + FSDP + 8 cards

AMD + DEEPSPEED + 2 cards

INTEL + DEEPSPEED + 1 card

CPU + "NONE" + No cards

MPS + "NONE" + No cards (or should it be 1 card if we can find a runner with MPS properly working?)

The list can then be expanded as needed if we feel like we want to have better coverage for some feature paths (f.e. test NVIDIA with both frameworks and in two card configuration - 1 vs 8).

Yeah agreed, testing the entire space is intractable.

I think we can probably restrict the space to:
[variants] x [fsdp or deepspeed] x [single or multi-card]

We can assume that the topology is >=1 accelerator and >=1 node. We shouldn't be concerned with CPU or MPS right now.

JamesKunstle · 2025-01-24T21:46:07Z

The current CI failure seems to be because the workflow can't access vars.AWS_REGION even though the workflow itself hasn't changed in this PR and the branch is rebased onto main, for which the unit test works. I'm not sure what that's happening.
@danmcp @courtneypacheco do ya'll have any guidance? I was assuming that the protected variables would be exposed to the workflows once they were merged.

danmcp · 2025-01-24T22:07:12Z

The current CI failure seems to be because the workflow can't access vars.AWS_REGION even though the workflow itself hasn't changed in this PR and the branch is rebased onto main, for which the unit test works. I'm not sure what that's happening. @danmcp @courtneypacheco do ya'll have any guidance? I was assuming that the protected variables would be exposed to the workflows once they were merged.

I am looking at the difference between:

https://github.com/instructlab/training/blob/main/.github/workflows/unit-tests.yaml#L12

and

https://github.com/instructlab/training/blob/main/.github/workflows/e2e-nvidia-l4-x1.yml#L5

Were there reasons for the unit tests to be different?

See: https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows#pull_request_target

JamesKunstle · 2025-01-24T22:29:14Z

No there wasn't a reason for the two to be different apart from just writing the unit-test workflow from scratch. The two should have the same behavior, I'll amend that.

I'm confused about the vars context availability though because this CI workflow has merged to the main branch and doesn't seem to be able to access that context.

danmcp · 2025-01-24T22:33:09Z

No there wasn't a reason for the two to be different apart from just writing the unit-test workflow from scratch. The two should have the same behavior, I'll amend that.

I'm confused about the vars context availability though because this CI workflow has merged to the main branch and doesn't seem to be able to access that context.

pull_request runs from the context of your merge commit and pull_request_target runs from the context of instructlab/training. So when running from the context of your branch, it can't see the var from the instructlab/training repo.

JamesKunstle · 2025-01-24T22:43:14Z

That's super interesting, I totally missed that distinction in the docs. I'll look more closely at that.

I opened a PR #411 that mirrors the on and permissions usage between the e2e workflow invocation to the unit-tests invocation.

JamesKunstle · 2025-01-24T22:47:15Z

From the docs:
"For workflows that are triggered by the pull_request_target event, the GITHUB_TOKEN is granted read/write repository permission unless the permissions key is specified and the workflow can access secrets, even when it is triggered from a fork. Although the workflow runs in the context of the base of the pull request, you should make sure that you do not check out, build, or run untrusted code from the pull request with this event."

My new understanding is that we have to use pull_request_target because we are setting up a self-hosted ec2 runner and need secrets / vars context to do this. We then limit the permissions of the job that runs the untrusted code.

If we weren't using these secure resources we could use the pull_request invocation option instead.

danmcp · 2025-01-24T22:55:17Z

My new understanding is that we have to use pull_request_target because we are setting up a self-hosted ec2 runner and need secrets / vars context to do this. We then limit the permissions of the job that runs the untrusted code.

If we weren't using these secure resources we could use the pull_request invocation option instead.

That matches my understanding as well. I had originally made sure you had permissions set to {}, but I missed the pull_request vs pull_request_target at the top.

JamesKunstle · 2025-01-24T23:12:02Z

Here's the most informative doc IMO:
https://docs.github.com/en/actions/managing-workflow-runs-and-deployments/managing-workflow-runs/approving-workflow-runs-from-public-forks

cdoern

this looks good, few comments

cdoern · 2025-04-08T14:33:53Z

docs/upstream-testing-plan.md

+
+### Responsibilities of: Smoke Tests
+
+The origin of the term "smoke test" comes from plumbing. Plumbers would pipe smoke, instead of water, through a completed pipe system to check that there were no leaks.


today I learned

cdoern · 2025-04-08T14:35:36Z

docs/upstream-testing-plan.md

+
+```python
+
+for variant in ["nvidia", "amd", "intel", "cpu", "mps"]: # parallel at runner level


comment here: there are no MPS runners to my knowledge. We would need to self host runners for most of these scenarios outside of CPU and CUDA

cdoern · 2025-04-08T14:37:24Z

docs/upstream-testing-plan.md

+
+### How often should we be running benchmark tests?
+
+There are many options for this depending on the observed risk that new features or dependency bumps seem to bring to the project. Likely, a comprehensive benchmark test should be done as a baseline and then the test should only be repeated when new model architectures are supported, major versions of ilab SDG are released, etc.


we might want to do this more often, like on Medium-Large library changes. Or upon a weekly release to ensure we didn't merge anything in the last week that broke stuff

booxter · 2025-04-08T14:19:14Z

docs/upstream-testing-plan.md

+# Testing in 'instructlab/instructlab'
+
+We would like for this library to be a lightweight, efficient, and hackable toolkit for training "small" LLMs.
+Currently, this library is used heavily by `github.com/instructlab/instructlab` and the `ilab` tool, and is tested via that project's end-to-end tests.


github.com/instructlab/instructlab and the ilab tool

Are these not the same?

You're right, fixed.

booxter · 2025-04-08T14:23:35Z

docs/upstream-testing-plan.md

+
+This is one of the most common patterns in software testing and requires little introduction.
+For our purposes, Unit Tests will be tests that do not require accelerated hardware (e.g. GPUs).
+For us, the objective of a Unit Test is to check that an isolateable sub-system functions correctly.


A unit test, by definition, targets a unit. A (sub)system is, almost by definition, a collection of units (with their relationships).

What I'm saying is that we should have a separation of (1), e.g.:

(1) Unit tests (2) "(Sub)System" tests (3) Integration (multi-system) tests (4) Benchmarks.

In my opinion, instructlab project so far implemented most of its testing as 2+ or even 3+ only, and we have very few pure unit tests in the CLI repo, so perhaps this fact of life makes the distinction a bit fuzzy here.

(An example of a "not-really-a-unit" test in CLI repo would be any test that calls to functions of the CLI through click.Runner and observes the return code / output. These are not testing a unit - a class, a function... - but the whole machinery of CLI and should not have been called "unit", IMO.)

gotcha, I incorporated that distinction

booxter · 2025-04-08T14:28:00Z

docs/upstream-testing-plan.md

+
+To compensate for this challenge, we propose the following three-tier testing methodology:
+
+1. Unit tests:              verification that subsystems work correctly


Not strictly "testing" but a lot of real code issues - esp. in lack of more rigorous test coverage on other levels - are caught by linters / type checkers. I'd include them in testing methodology too.

booxter · 2025-04-08T14:32:43Z

docs/upstream-testing-plan.md

+The challenge, however, is that "basic" testing in this context probably still requires accelerated hardware. We can't check that checkpointing a LoRA model while using CPU Offloading
+doesn't break without spinning up that exact testing scenario.
+
+"Smoke tests" in this repo, therefore, will be defined as tests that demonstrate the functionality, though not the correctness, of features that we want to ship. This is


demonstrate the functionality, though not the correctness

I'm a bit confused by the distinction. I think smoke tests are checking for correctness, just for a more limited surface / scope. They are usually almost identical to the next step of the testing ladder, just using a smaller environment / running a lower number of test cases for the most basic scenarios.

Yeah this could be worded better.
Essentially what I'm describing is that we'd be looking for "loss went down" in the test, rather than training until convergence and benchmarking.

booxter · 2025-04-08T14:39:19Z

docs/upstream-testing-plan.md

+
+```python
+
+for variant in ["nvidia", "amd", "intel", "cpu", "mps"]: # parallel at runner level


I see O(n^3) configs here. This is indeed not scalable. In some projects with similar problems I worked for, this dilemma was solved by crafting a set of "verified configurations" that were defined in such a way so that they cover the highest number of unique paths through the multi-dimentional feature space without blowing up the number of permutations so much.

For example, one could have:

NVIDIA + FSDP + 8 cards

AMD + DEEPSPEED + 2 cards

INTEL + DEEPSPEED + 1 card

CPU + "NONE" + No cards

MPS + "NONE" + No cards (or should it be 1 card if we can find a runner with MPS properly working?)

The list can then be expanded as needed if we feel like we want to have better coverage for some feature paths (f.e. test NVIDIA with both frameworks and in two card configuration - 1 vs 8).

booxter · 2025-04-08T14:42:25Z

docs/upstream-testing-plan.md

+Benchmark testing might look like the following:
+
+1. Tune a few models on well-understood dataset using one or two permutations of features, unique to benchmark run.
+2. Benchmark model on battery of evaluation metrics that give us the strongest signal re: model performance.


One question here is whether instructlab/training library should depend on instructlab/eval library for its own benchmarking, or whether a more generic library (lm-eval) should be used instead. Does eval library bring anything unique for testing purposes?

I don't have a clear answer on this upfront. I feel like we'd want to dogfood our own lib but be flexible

booxter · 2025-04-08T14:46:09Z

docs/upstream-testing-plan.md

+
+### Responsibilities of: Unit Tests
+
+This is one of the most common patterns in software testing and requires little introduction.


I'd include automated coverage targets / reports here. I know coverage targets are a bit controversial (Goodhart's law), but assuming we understand the risks, coverage reports and enforcement should help us not to take sight off the ball.

That's a good idea. We can make an issue to investigate this in the future

booxter · 2025-04-08T14:48:08Z

docs/upstream-testing-plan.md

+This library attempts to provide training support for multiple variants of accelerators, numbers of accelerators, two distributed backends, and many features. Therefore, the completeness matrix
+is roughly explained by the following pseudocode:
+
+```python


I'd recommend to explicitly state that tests are written in python using pytest framework. I'd rather not perpetuate the "functional tests" bash shell pattern found elsewhere, in this repo, if possible.

booxter · 2025-04-08T14:51:42Z

Perhaps CI folks would like to chime in here @courtneypacheco @ktdreyer

Signed-off-by: James Kunstle <jkunstle@redhat.com>

RobotSail

Thank you @JamesKunstle for providing this document, having an automated way of validating incoming code will be very valuable to have in this library.

After reviewing this document, I'm requesting these changes

Benchmarks: Measuring training library correctness through benchmarks will be unreliable for testing a training library and will require us to run many benchmarks to have a correct result. For example, a model training incorrectly on AMD might score well on MMLU but will also recommend users to drink and drive. Additionally, it will be very computationally expensive to run these so the ROI here would be very low
Plotting loss curves: An easy way to test the correctness for something complex like a training library would be to create a system in our testing where we can export and store loss data from all of the training runs we do, and then have a simple way to compare it against the matrix of configurations we've included here, and past runs we've done historically

This way we can make testing into something viable in the training library, without greatly impacting friction of adding contributions

RobotSail · 2025-04-11T20:07:48Z

docs/upstream-testing-plan.md

+This puts the onus on the Unit Tests (and the test writer) to verify code correctness as much as they possibly can without using hardware. If something requires many invocations of
+smoke tests to debug, it probably wasn't sufficiently debugged during development, or is insufficiently unit tested.
+
+Smoke tests will inevitably require a high-spec development machine to run locally. It shouldn't be acceptable for smoke tests to run for >60 minutes- we should aim for them to run in <30 minutes.


Suggested change

Smoke tests will inevitably require a high-spec development machine to run locally. It shouldn't be acceptable for smoke tests to run for >60 minutes- we should aim for them to run in <30 minutes.

Smoke tests will inevitably require a high-spec development machine to run locally. It shouldn't be acceptable for smoke tests to run for >60 minutes- we should aim for them to run in <5 minutes.

I'd like to keep the upper-bound to 30min. That's the hard cutoff- faster is always better.

RobotSail · 2025-04-11T20:09:30Z

docs/upstream-testing-plan.md

+
+When we have to add a new accelerator variant or feature, we can plug them into their respective tier in the smoke testing hierarchy.
+
+### Responsibilities of: Benchmark Tests


Benchmark tests are going to be far more computationally intensive than to test a training job, and are also unlikely to provide meaningful signals that a given model is learning correctly. We should not rely on these for testing training.

Yeah likely true. I can remove this / deprioritize it.

RobotSail · 2025-04-12T22:52:04Z

docs/upstream-testing-plan.md

@@ -0,0 +1,114 @@
+# Testing in 'instructlab/instructlab'
+
+We would like for this library to be a lightweight, efficient, and hackable toolkit for training LLMs with SFT and RLHF.


What is meant by RLHF in this case?

"Reinforcement Learning from Human Feedback"

JamesKunstle added documentation Improvements or additions to documentation CI/CD Affects CI/CD configuration labels Jan 21, 2025

JamesKunstle requested review from danmcp, nathan-weinberg and RobotSail January 21, 2025 00:41

JamesKunstle self-assigned this Jan 21, 2025

JamesKunstle marked this pull request as ready for review January 21, 2025 00:41

mergify bot added the ci-failure label Jan 21, 2025

JamesKunstle force-pushed the upstream-testing-proposal branch from fc72782 to f8c0843 Compare January 21, 2025 21:40

danmcp reviewed Jan 23, 2025

View reviewed changes

JamesKunstle force-pushed the upstream-testing-proposal branch from f8c0843 to c405770 Compare January 24, 2025 21:39

JamesKunstle requested a review from danmcp January 24, 2025 21:39

mergify bot added ci-failure and removed ci-failure labels Jan 24, 2025

JamesKunstle force-pushed the upstream-testing-proposal branch from c405770 to 68f34ee Compare January 24, 2025 22:50

mergify bot removed the ci-failure label Jan 24, 2025

cdoern approved these changes Apr 8, 2025

View reviewed changes

mergify bot added the one-approval label Apr 8, 2025

booxter reviewed Apr 8, 2025

View reviewed changes

RobotSail requested a review from aldopareja April 8, 2025 16:29

RobotSail requested a review from Maxusmusti April 8, 2025 16:29

adds upstream testing development document

46456a1

Signed-off-by: James Kunstle <jkunstle@redhat.com>

JamesKunstle force-pushed the upstream-testing-proposal branch from 68f34ee to 46456a1 Compare April 8, 2025 18:35

booxter approved these changes Apr 10, 2025

View reviewed changes

ktdreyer approved these changes Apr 11, 2025

View reviewed changes

mergify bot removed the one-approval label Apr 11, 2025

RobotSail added the hold label Apr 11, 2025

RobotSail requested changes Apr 11, 2025

View reviewed changes

RobotSail removed the hold label Apr 12, 2025

RobotSail reviewed Apr 12, 2025

View reviewed changes

booxter mentioned this pull request Apr 15, 2025

Rewrite smoke tests from tests/smoketest.sh as pytest smoke tests #462

Open


		```python

		for variant in ["nvidia", "amd", "intel", "cpu", "mps"]: # parallel at runner level


		### Responsibilities of: Smoke Tests

		The origin of the term "smoke test" comes from plumbing. Plumbers would pipe smoke, instead of water, through a completed pipe system to check that there were no leaks.


		### How often should we be running benchmark tests?

		There are many options for this depending on the observed risk that new features or dependency bumps seem to bring to the project. Likely, a comprehensive benchmark test should be done as a baseline and then the test should only be repeated when new model architectures are supported, major versions of ilab SDG are released, etc.


		To compensate for this challenge, we propose the following three-tier testing methodology:

		1. Unit tests: verification that subsystems work correctly


		### Responsibilities of: Unit Tests

		This is one of the most common patterns in software testing and requires little introduction.

	Smoke tests will inevitably require a high-spec development machine to run locally. It shouldn't be acceptable for smoke tests to run for >60 minutes- we should aim for them to run in <30 minutes.
	Smoke tests will inevitably require a high-spec development machine to run locally. It shouldn't be acceptable for smoke tests to run for >60 minutes- we should aim for them to run in <5 minutes.


		When we have to add a new accelerator variant or feature, we can plug them into their respective tier in the smoke testing hierarchy.

		### Responsibilities of: Benchmark Tests

		@@ -0,0 +1,114 @@
		# Testing in 'instructlab/instructlab'

		We would like for this library to be a lightweight, efficient, and hackable toolkit for training LLMs with SFT and RLHF.

adds upstream testing development document #404

Are you sure you want to change the base?

adds upstream testing development document #404

Uh oh!

Conversation

JamesKunstle commented Jan 21, 2025

Uh oh!

nathan-weinberg commented Jan 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JamesKunstle commented Jan 24, 2025

Uh oh!

danmcp commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JamesKunstle commented Jan 24, 2025

Uh oh!

danmcp commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JamesKunstle commented Jan 24, 2025

Uh oh!

JamesKunstle commented Jan 24, 2025

Uh oh!

danmcp commented Jan 24, 2025

Uh oh!

JamesKunstle commented Jan 24, 2025

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

booxter Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

booxter Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

booxter commented Apr 8, 2025

Uh oh!

danmcp commented Jan 24, 2025 •

edited

Loading

danmcp commented Jan 24, 2025 •

edited

Loading

booxter Apr 8, 2025 •

edited

Loading

booxter Apr 8, 2025 •

edited

Loading