Skip to content

adds upstream testing development document #404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

JamesKunstle
Copy link
Contributor

Adds a dev doc describing our testing strategy for this repo.

@JamesKunstle JamesKunstle added documentation Improvements or additions to documentation CI/CD Affects CI/CD configuration labels Jan 21, 2025
@JamesKunstle JamesKunstle self-assigned this Jan 21, 2025
@JamesKunstle JamesKunstle marked this pull request as ready for review January 21, 2025 00:41
@mergify mergify bot added the ci-failure label Jan 21, 2025
@nathan-weinberg
Copy link
Member

@JamesKunstle can you fix the markdown linting? can run make md-lint for debugging locally

@JamesKunstle JamesKunstle force-pushed the upstream-testing-proposal branch from fc72782 to f8c0843 Compare January 21, 2025 21:40

```python

for variant in ["nvidia", "amd", "intel", "cpu", "mps"]: # parallel at runner level
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be a real challenge with runner availability to cover these on every PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I figure that we'll scale out a little at a time, starting w/ CPU and nvidia, and adding runners to the matrix as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment here: there are no MPS runners to my knowledge. We would need to self host runners for most of these scenarios outside of CPU and CUDA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see O(n^3) configs here. This is indeed not scalable. In some projects with similar problems I worked for, this dilemma was solved by crafting a set of "verified configurations" that were defined in such a way so that they cover the highest number of unique paths through the multi-dimentional feature space without blowing up the number of permutations so much.

For example, one could have:

  1. NVIDIA + FSDP + 8 cards
  2. AMD + DEEPSPEED + 2 cards
  3. INTEL + DEEPSPEED + 1 card
  4. CPU + "NONE" + No cards
  5. MPS + "NONE" + No cards (or should it be 1 card if we can find a runner with MPS properly working?)

The list can then be expanded as needed if we feel like we want to have better coverage for some feature paths (f.e. test NVIDIA with both frameworks and in two card configuration - 1 vs 8).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed, testing the entire space is intractable.

I think we can probably restrict the space to:
[variants] x [fsdp or deepspeed] x [single or multi-card]

We can assume that the topology is >=1 accelerator and >=1 node. We shouldn't be concerned with CPU or MPS right now.

@JamesKunstle JamesKunstle force-pushed the upstream-testing-proposal branch from f8c0843 to c405770 Compare January 24, 2025 21:39
@JamesKunstle JamesKunstle requested a review from danmcp January 24, 2025 21:39
@JamesKunstle
Copy link
Contributor Author

The current CI failure seems to be because the workflow can't access vars.AWS_REGION even though the workflow itself hasn't changed in this PR and the branch is rebased onto main, for which the unit test works. I'm not sure what that's happening.
@danmcp @courtneypacheco do ya'll have any guidance? I was assuming that the protected variables would be exposed to the workflows once they were merged.

@danmcp
Copy link
Member

danmcp commented Jan 24, 2025

The current CI failure seems to be because the workflow can't access vars.AWS_REGION even though the workflow itself hasn't changed in this PR and the branch is rebased onto main, for which the unit test works. I'm not sure what that's happening. @danmcp @courtneypacheco do ya'll have any guidance? I was assuming that the protected variables would be exposed to the workflows once they were merged.

I am looking at the difference between:

https://github.com/instructlab/training/blob/main/.github/workflows/unit-tests.yaml#L12

and

https://github.com/instructlab/training/blob/main/.github/workflows/e2e-nvidia-l4-x1.yml#L5

Were there reasons for the unit tests to be different?

See: https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows#pull_request_target

@JamesKunstle
Copy link
Contributor Author

No there wasn't a reason for the two to be different apart from just writing the unit-test workflow from scratch. The two should have the same behavior, I'll amend that.

I'm confused about the vars context availability though because this CI workflow has merged to the main branch and doesn't seem to be able to access that context.

@danmcp
Copy link
Member

danmcp commented Jan 24, 2025

No there wasn't a reason for the two to be different apart from just writing the unit-test workflow from scratch. The two should have the same behavior, I'll amend that.

I'm confused about the vars context availability though because this CI workflow has merged to the main branch and doesn't seem to be able to access that context.

pull_request runs from the context of your merge commit and pull_request_target runs from the context of instructlab/training. So when running from the context of your branch, it can't see the var from the instructlab/training repo.

@JamesKunstle
Copy link
Contributor Author

That's super interesting, I totally missed that distinction in the docs. I'll look more closely at that.

I opened a PR #411 that mirrors the on and permissions usage between the e2e workflow invocation to the unit-tests invocation.

@JamesKunstle
Copy link
Contributor Author

From the docs:
"For workflows that are triggered by the pull_request_target event, the GITHUB_TOKEN is granted read/write repository permission unless the permissions key is specified and the workflow can access secrets, even when it is triggered from a fork. Although the workflow runs in the context of the base of the pull request, you should make sure that you do not check out, build, or run untrusted code from the pull request with this event."

My new understanding is that we have to use pull_request_target because we are setting up a self-hosted ec2 runner and need secrets / vars context to do this. We then limit the permissions of the job that runs the untrusted code.

If we weren't using these secure resources we could use the pull_request invocation option instead.

@mergify mergify bot added ci-failure and removed ci-failure labels Jan 24, 2025
@JamesKunstle JamesKunstle force-pushed the upstream-testing-proposal branch from c405770 to 68f34ee Compare January 24, 2025 22:50
@mergify mergify bot removed the ci-failure label Jan 24, 2025
@danmcp
Copy link
Member

danmcp commented Jan 24, 2025

My new understanding is that we have to use pull_request_target because we are setting up a self-hosted ec2 runner and need secrets / vars context to do this. We then limit the permissions of the job that runs the untrusted code.

If we weren't using these secure resources we could use the pull_request invocation option instead.

That matches my understanding as well. I had originally made sure you had permissions set to {}, but I missed the pull_request vs pull_request_target at the top.

@JamesKunstle
Copy link
Contributor Author

Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good, few comments


### Responsibilities of: Smoke Tests

The origin of the term "smoke test" comes from plumbing. Plumbers would pipe smoke, instead of water, through a completed pipe system to check that there were no leaks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

today I learned


```python

for variant in ["nvidia", "amd", "intel", "cpu", "mps"]: # parallel at runner level
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment here: there are no MPS runners to my knowledge. We would need to self host runners for most of these scenarios outside of CPU and CUDA


### How often should we be running benchmark tests?

There are many options for this depending on the observed risk that new features or dependency bumps seem to bring to the project. Likely, a comprehensive benchmark test should be done as a baseline and then the test should only be repeated when new model architectures are supported, major versions of ilab SDG are released, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to do this more often, like on Medium-Large library changes. Or upon a weekly release to ensure we didn't merge anything in the last week that broke stuff

@mergify mergify bot added the one-approval label Apr 8, 2025
# Testing in 'instructlab/instructlab'

We would like for this library to be a lightweight, efficient, and hackable toolkit for training "small" LLMs.
Currently, this library is used heavily by `github.com/instructlab/instructlab` and the `ilab` tool, and is tested via that project's end-to-end tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github.com/instructlab/instructlab and the ilab tool

Are these not the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, fixed.


This is one of the most common patterns in software testing and requires little introduction.
For our purposes, Unit Tests will be tests that do not require accelerated hardware (e.g. GPUs).
For us, the objective of a Unit Test is to check that an isolateable sub-system functions correctly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A unit test, by definition, targets a unit. A (sub)system is, almost by definition, a collection of units (with their relationships).

What I'm saying is that we should have a separation of (1), e.g.:

(1) Unit tests (2) "(Sub)System" tests (3) Integration (multi-system) tests (4) Benchmarks.

In my opinion, instructlab project so far implemented most of its testing as 2+ or even 3+ only, and we have very few pure unit tests in the CLI repo, so perhaps this fact of life makes the distinction a bit fuzzy here.

(An example of a "not-really-a-unit" test in CLI repo would be any test that calls to functions of the CLI through click.Runner and observes the return code / output. These are not testing a unit - a class, a function... - but the whole machinery of CLI and should not have been called "unit", IMO.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, I incorporated that distinction


To compensate for this challenge, we propose the following three-tier testing methodology:

1. Unit tests: verification that subsystems work correctly
Copy link
Contributor

@booxter booxter Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Not strictly "testing" but a lot of real code issues - esp. in lack of more rigorous test coverage on other levels - are caught by linters / type checkers. I'd include them in testing methodology too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added that

The challenge, however, is that "basic" testing in this context probably still requires accelerated hardware. We can't check that checkpointing a LoRA model while using CPU Offloading
doesn't break without spinning up that exact testing scenario.

"Smoke tests" in this repo, therefore, will be defined as tests that demonstrate the functionality, though not the correctness, of features that we want to ship. This is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

demonstrate the functionality, though not the correctness

I'm a bit confused by the distinction. I think smoke tests are checking for correctness, just for a more limited surface / scope. They are usually almost identical to the next step of the testing ladder, just using a smaller environment / running a lower number of test cases for the most basic scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this could be worded better.
Essentially what I'm describing is that we'd be looking for "loss went down" in the test, rather than training until convergence and benchmarking.


```python

for variant in ["nvidia", "amd", "intel", "cpu", "mps"]: # parallel at runner level
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see O(n^3) configs here. This is indeed not scalable. In some projects with similar problems I worked for, this dilemma was solved by crafting a set of "verified configurations" that were defined in such a way so that they cover the highest number of unique paths through the multi-dimentional feature space without blowing up the number of permutations so much.

For example, one could have:

  1. NVIDIA + FSDP + 8 cards
  2. AMD + DEEPSPEED + 2 cards
  3. INTEL + DEEPSPEED + 1 card
  4. CPU + "NONE" + No cards
  5. MPS + "NONE" + No cards (or should it be 1 card if we can find a runner with MPS properly working?)

The list can then be expanded as needed if we feel like we want to have better coverage for some feature paths (f.e. test NVIDIA with both frameworks and in two card configuration - 1 vs 8).

Benchmark testing might look like the following:

1. Tune a few models on well-understood dataset using one or two permutations of features, unique to benchmark run.
2. Benchmark model on battery of evaluation metrics that give us the strongest signal re: model performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question here is whether instructlab/training library should depend on instructlab/eval library for its own benchmarking, or whether a more generic library (lm-eval) should be used instead. Does eval library bring anything unique for testing purposes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a clear answer on this upfront. I feel like we'd want to dogfood our own lib but be flexible


### Responsibilities of: Unit Tests

This is one of the most common patterns in software testing and requires little introduction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd include automated coverage targets / reports here. I know coverage targets are a bit controversial (Goodhart's law), but assuming we understand the risks, coverage reports and enforcement should help us not to take sight off the ball.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. We can make an issue to investigate this in the future

This library attempts to provide training support for multiple variants of accelerators, numbers of accelerators, two distributed backends, and many features. Therefore, the completeness matrix
is roughly explained by the following pseudocode:

```python
Copy link
Contributor

@booxter booxter Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend to explicitly state that tests are written in python using pytest framework. I'd rather not perpetuate the "functional tests" bash shell pattern found elsewhere, in this repo, if possible.

@booxter
Copy link
Contributor

booxter commented Apr 8, 2025

Perhaps CI folks would like to chime in here @courtneypacheco @ktdreyer

@RobotSail RobotSail requested a review from aldopareja April 8, 2025 16:29
@RobotSail RobotSail requested a review from Maxusmusti April 8, 2025 16:29
Signed-off-by: James Kunstle <jkunstle@redhat.com>
@JamesKunstle JamesKunstle force-pushed the upstream-testing-proposal branch from 68f34ee to 46456a1 Compare April 8, 2025 18:35
@mergify mergify bot removed the one-approval label Apr 11, 2025
@RobotSail RobotSail added the hold label Apr 11, 2025
Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @JamesKunstle for providing this document, having an automated way of validating incoming code will be very valuable to have in this library.

After reviewing this document, I'm requesting these changes

  • Benchmarks: Measuring training library correctness through benchmarks will be unreliable for testing a training library and will require us to run many benchmarks to have a correct result. For example, a model training incorrectly on AMD might score well on MMLU but will also recommend users to drink and drive. Additionally, it will be very computationally expensive to run these so the ROI here would be very low
  • Plotting loss curves: An easy way to test the correctness for something complex like a training library would be to create a system in our testing where we can export and store loss data from all of the training runs we do, and then have a simple way to compare it against the matrix of configurations we've included here, and past runs we've done historically

This way we can make testing into something viable in the training library, without greatly impacting friction of adding contributions

This puts the onus on the Unit Tests (and the test writer) to verify code correctness as much as they possibly can without using hardware. If something requires many invocations of
smoke tests to debug, it probably wasn't sufficiently debugged during development, or is insufficiently unit tested.

Smoke tests will inevitably require a high-spec development machine to run locally. It shouldn't be acceptable for smoke tests to run for >60 minutes- we should aim for them to run in <30 minutes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Smoke tests will inevitably require a high-spec development machine to run locally. It shouldn't be acceptable for smoke tests to run for >60 minutes- we should aim for them to run in <30 minutes.
Smoke tests will inevitably require a high-spec development machine to run locally. It shouldn't be acceptable for smoke tests to run for >60 minutes- we should aim for them to run in <5 minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep the upper-bound to 30min. That's the hard cutoff- faster is always better.


When we have to add a new accelerator variant or feature, we can plug them into their respective tier in the smoke testing hierarchy.

### Responsibilities of: Benchmark Tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark tests are going to be far more computationally intensive than to test a training job, and are also unlikely to provide meaningful signals that a given model is learning correctly. We should not rely on these for testing training.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah likely true. I can remove this / deprioritize it.

@RobotSail RobotSail removed the hold label Apr 12, 2025
@@ -0,0 +1,114 @@
# Testing in 'instructlab/instructlab'

We would like for this library to be a lightweight, efficient, and hackable toolkit for training LLMs with SFT and RLHF.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by RLHF in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Reinforcement Learning from Human Feedback"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants