-
Notifications
You must be signed in to change notification settings - Fork 63
adds upstream testing development document #404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
adds upstream testing development document #404
Conversation
@JamesKunstle can you fix the markdown linting? can run |
fc72782
to
f8c0843
Compare
docs/upstream-testing-plan.md
Outdated
|
||
```python | ||
|
||
for variant in ["nvidia", "amd", "intel", "cpu", "mps"]: # parallel at runner level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be a real challenge with runner availability to cover these on every PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I figure that we'll scale out a little at a time, starting w/ CPU and nvidia, and adding runners to the matrix as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment here: there are no MPS runners to my knowledge. We would need to self host runners for most of these scenarios outside of CPU and CUDA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see O(n^3)
configs here. This is indeed not scalable. In some projects with similar problems I worked for, this dilemma was solved by crafting a set of "verified configurations" that were defined in such a way so that they cover the highest number of unique paths through the multi-dimentional feature space without blowing up the number of permutations so much.
For example, one could have:
- NVIDIA + FSDP + 8 cards
- AMD + DEEPSPEED + 2 cards
- INTEL + DEEPSPEED + 1 card
- CPU + "NONE" + No cards
- MPS + "NONE" + No cards (or should it be 1 card if we can find a runner with MPS properly working?)
The list can then be expanded as needed if we feel like we want to have better coverage for some feature paths (f.e. test NVIDIA with both frameworks and in two card configuration - 1 vs 8).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed, testing the entire space is intractable.
I think we can probably restrict the space to:
[variants] x [fsdp or deepspeed] x [single or multi-card]
We can assume that the topology is >=1 accelerator and >=1 node. We shouldn't be concerned with CPU or MPS right now.
f8c0843
to
c405770
Compare
The current CI failure seems to be because the workflow can't access |
I am looking at the difference between: https://github.com/instructlab/training/blob/main/.github/workflows/unit-tests.yaml#L12 and Were there reasons for the unit tests to be different? |
No there wasn't a reason for the two to be different apart from just writing the unit-test workflow from scratch. The two should have the same behavior, I'll amend that. I'm confused about the |
pull_request runs from the context of your merge commit and pull_request_target runs from the context of instructlab/training. So when running from the context of your branch, it can't see the var from the instructlab/training repo. |
That's super interesting, I totally missed that distinction in the docs. I'll look more closely at that. I opened a PR #411 that mirrors the |
From the docs: My new understanding is that we have to use If we weren't using these secure resources we could use the |
c405770
to
68f34ee
Compare
That matches my understanding as well. I had originally made sure you had permissions set to {}, but I missed the pull_request vs pull_request_target at the top. |
Here's the most informative doc IMO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good, few comments
|
||
### Responsibilities of: Smoke Tests | ||
|
||
The origin of the term "smoke test" comes from plumbing. Plumbers would pipe smoke, instead of water, through a completed pipe system to check that there were no leaks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
today I learned
docs/upstream-testing-plan.md
Outdated
|
||
```python | ||
|
||
for variant in ["nvidia", "amd", "intel", "cpu", "mps"]: # parallel at runner level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment here: there are no MPS runners to my knowledge. We would need to self host runners for most of these scenarios outside of CPU and CUDA
|
||
### How often should we be running benchmark tests? | ||
|
||
There are many options for this depending on the observed risk that new features or dependency bumps seem to bring to the project. Likely, a comprehensive benchmark test should be done as a baseline and then the test should only be repeated when new model architectures are supported, major versions of ilab SDG are released, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to do this more often, like on Medium-Large library changes. Or upon a weekly release to ensure we didn't merge anything in the last week that broke stuff
docs/upstream-testing-plan.md
Outdated
# Testing in 'instructlab/instructlab' | ||
|
||
We would like for this library to be a lightweight, efficient, and hackable toolkit for training "small" LLMs. | ||
Currently, this library is used heavily by `github.com/instructlab/instructlab` and the `ilab` tool, and is tested via that project's end-to-end tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
github.com/instructlab/instructlab
and theilab
tool
Are these not the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, fixed.
docs/upstream-testing-plan.md
Outdated
|
||
This is one of the most common patterns in software testing and requires little introduction. | ||
For our purposes, Unit Tests will be tests that do not require accelerated hardware (e.g. GPUs). | ||
For us, the objective of a Unit Test is to check that an isolateable sub-system functions correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A unit test, by definition, targets a unit. A (sub)system is, almost by definition, a collection of units (with their relationships).
What I'm saying is that we should have a separation of (1), e.g.:
(1) Unit tests (2) "(Sub)System" tests (3) Integration (multi-system) tests (4) Benchmarks.
In my opinion, instructlab project so far implemented most of its testing as 2+ or even 3+ only, and we have very few pure unit tests in the CLI repo, so perhaps this fact of life makes the distinction a bit fuzzy here.
(An example of a "not-really-a-unit" test in CLI repo would be any test that calls to functions of the CLI through click.Runner and observes the return code / output. These are not testing a unit - a class, a function... - but the whole machinery of CLI and should not have been called "unit", IMO.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, I incorporated that distinction
docs/upstream-testing-plan.md
Outdated
|
||
To compensate for this challenge, we propose the following three-tier testing methodology: | ||
|
||
1. Unit tests: verification that subsystems work correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Not strictly "testing" but a lot of real code issues - esp. in lack of more rigorous test coverage on other levels - are caught by linters / type checkers. I'd include them in testing methodology too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added that
docs/upstream-testing-plan.md
Outdated
The challenge, however, is that "basic" testing in this context probably still requires accelerated hardware. We can't check that checkpointing a LoRA model while using CPU Offloading | ||
doesn't break without spinning up that exact testing scenario. | ||
|
||
"Smoke tests" in this repo, therefore, will be defined as tests that demonstrate the functionality, though not the correctness, of features that we want to ship. This is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
demonstrate the functionality, though not the correctness
I'm a bit confused by the distinction. I think smoke tests are checking for correctness, just for a more limited surface / scope. They are usually almost identical to the next step of the testing ladder, just using a smaller environment / running a lower number of test cases for the most basic scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this could be worded better.
Essentially what I'm describing is that we'd be looking for "loss went down" in the test, rather than training until convergence and benchmarking.
docs/upstream-testing-plan.md
Outdated
|
||
```python | ||
|
||
for variant in ["nvidia", "amd", "intel", "cpu", "mps"]: # parallel at runner level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see O(n^3)
configs here. This is indeed not scalable. In some projects with similar problems I worked for, this dilemma was solved by crafting a set of "verified configurations" that were defined in such a way so that they cover the highest number of unique paths through the multi-dimentional feature space without blowing up the number of permutations so much.
For example, one could have:
- NVIDIA + FSDP + 8 cards
- AMD + DEEPSPEED + 2 cards
- INTEL + DEEPSPEED + 1 card
- CPU + "NONE" + No cards
- MPS + "NONE" + No cards (or should it be 1 card if we can find a runner with MPS properly working?)
The list can then be expanded as needed if we feel like we want to have better coverage for some feature paths (f.e. test NVIDIA with both frameworks and in two card configuration - 1 vs 8).
Benchmark testing might look like the following: | ||
|
||
1. Tune a few models on well-understood dataset using one or two permutations of features, unique to benchmark run. | ||
2. Benchmark model on battery of evaluation metrics that give us the strongest signal re: model performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question here is whether instructlab/training library should depend on instructlab/eval library for its own benchmarking, or whether a more generic library (lm-eval) should be used instead. Does eval library bring anything unique for testing purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a clear answer on this upfront. I feel like we'd want to dogfood our own lib but be flexible
|
||
### Responsibilities of: Unit Tests | ||
|
||
This is one of the most common patterns in software testing and requires little introduction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd include automated coverage targets / reports here. I know coverage targets are a bit controversial (Goodhart's law), but assuming we understand the risks, coverage reports and enforcement should help us not to take sight off the ball.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. We can make an issue to investigate this in the future
This library attempts to provide training support for multiple variants of accelerators, numbers of accelerators, two distributed backends, and many features. Therefore, the completeness matrix | ||
is roughly explained by the following pseudocode: | ||
|
||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend to explicitly state that tests are written in python using pytest framework. I'd rather not perpetuate the "functional tests" bash shell pattern found elsewhere, in this repo, if possible.
Perhaps CI folks would like to chime in here @courtneypacheco @ktdreyer |
Signed-off-by: James Kunstle <jkunstle@redhat.com>
68f34ee
to
46456a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @JamesKunstle for providing this document, having an automated way of validating incoming code will be very valuable to have in this library.
After reviewing this document, I'm requesting these changes
- Benchmarks: Measuring training library correctness through benchmarks will be unreliable for testing a training library and will require us to run many benchmarks to have a correct result. For example, a model training incorrectly on AMD might score well on MMLU but will also recommend users to drink and drive. Additionally, it will be very computationally expensive to run these so the ROI here would be very low
- Plotting loss curves: An easy way to test the correctness for something complex like a training library would be to create a system in our testing where we can export and store loss data from all of the training runs we do, and then have a simple way to compare it against the matrix of configurations we've included here, and past runs we've done historically
This way we can make testing into something viable in the training library, without greatly impacting friction of adding contributions
This puts the onus on the Unit Tests (and the test writer) to verify code correctness as much as they possibly can without using hardware. If something requires many invocations of | ||
smoke tests to debug, it probably wasn't sufficiently debugged during development, or is insufficiently unit tested. | ||
|
||
Smoke tests will inevitably require a high-spec development machine to run locally. It shouldn't be acceptable for smoke tests to run for >60 minutes- we should aim for them to run in <30 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smoke tests will inevitably require a high-spec development machine to run locally. It shouldn't be acceptable for smoke tests to run for >60 minutes- we should aim for them to run in <30 minutes. | |
Smoke tests will inevitably require a high-spec development machine to run locally. It shouldn't be acceptable for smoke tests to run for >60 minutes- we should aim for them to run in <5 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to keep the upper-bound to 30min. That's the hard cutoff- faster is always better.
|
||
When we have to add a new accelerator variant or feature, we can plug them into their respective tier in the smoke testing hierarchy. | ||
|
||
### Responsibilities of: Benchmark Tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmark tests are going to be far more computationally intensive than to test a training job, and are also unlikely to provide meaningful signals that a given model is learning correctly. We should not rely on these for testing training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah likely true. I can remove this / deprioritize it.
@@ -0,0 +1,114 @@ | |||
# Testing in 'instructlab/instructlab' | |||
|
|||
We would like for this library to be a lightweight, efficient, and hackable toolkit for training LLMs with SFT and RLHF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by RLHF in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Reinforcement Learning from Human Feedback"
Adds a dev doc describing our testing strategy for this repo.