-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST: Add regression tests 2 #1115
TST: Add regression tests 2 #1115
Conversation
This PR supersedes huggingface#995. The description below is copied and modified from that PR. For some technical reasons, it was easier for me to create a new PR than to update the previous one, sorry for that. Description In general, for regression tests, we need two steps: 1. Creating the regression artifacts, in this case the adapter checkpoint and the expected output of the model. 2. Running the regression tests, i.e. loading the adapter and checking that the output of the model is the same as the expected output. My approach is to re-use as much code as possible between those two steps. Therefore, the same test script can be used for both, with only an environment variable to distinguish between the two. Step 1 is invoked by calling: `REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py` and to run the second step, we call: `pytest tests/regression/test_regression.py` Creating regression artifacts The first step will create an adapter checkpoint and an output for the given PEFT version and test setting in a new directory. E.g. it will create a directory `tests/regression/lora_opt-125m_bnb_4bit/0.5.0/` that contains adapter_model.bin and output.pt. Before this step runs, there is a check that the git repo is clean (no dirty worktree) and that the commit is tagged (i.e. corresponds to a release version of PEFT). Otherwise, we may accidentally create regression artifacts that do not correspond to any PEFT release. The easiest way to get such a clean state (say, for PEFT v0.5.0) is by checking out a tagged commit, e.g: `git checkout v0.5.0` before running the first step. The first step will also skip the creation of regression artifacts if they already exist. It is possible to circumvent all the aforementioned checks by setting the environment variable `REGRESSION_FORCE_MODE` to True like so: `REGRESSION_FORCE_MODE=True REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py` You should only do this if you know exactly what you're doing. Running regression tests The second step is much simpler. It will load the adapters and the output created in the first step, and compare the output to the output from a new PEFT model using the loaded adapter. The outputs should be the same. If more than one version is discovered for a given test setting, all of them are tested. Notes As is, the adapters are stored in the git repo itself. Since they're relatively small, the total size of the repo is still reasonable. However, it could be better to store those adapters on HF Hub instead. This would, however, make things a bit more complicated (not sure how to parse directories etc. on Hub). The regression tests in this included in this PR were used to check that huggingface#994 still allows to load checkpoints created with PEFT v0.6.1.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @BenjaminBossan for adding the extensive regression tests along with cmd configurations to automatically enable/skip regression tests 🚀!
One major issue for this approach is that it will bloat the package heavily and we should actively avoid that. I would prefer that we create the models and outputs and save them as revisions based on version on HF Hub and load it from there for the regression tests.
Previously, the corresponding tests were testing only whether specific parameters had requires_grad True or False. Now, all parameters are being checked. This is more rigorous. Also, tests for Embedding, Conv1D, Conv2d were added, thus superseding PR huggingface#1115. Finally, tests for LoHa and LoKr were added. Note I considered moving the tests to a separate module, as they were getting quite big and this would help with readability. For now, I left them in the same module because it leads to a better diff view and is thus easier to review. LMK if I should move the tests to a separate file.
Previously, the corresponding tests were testing only whether specific parameters had requires_grad True or False. Now, all parameters are being checked. This is more rigorous. Also, tests for Embedding, Conv1D, Conv2d were added, thus superseding PR #1115. Finally, tests for LoHa and LoKr were added. Note I considered moving the tests to a separate module, as they were getting quite big and this would help with readability. For now, I left them in the same module because it leads to a better diff view and is thus easier to review. LMK if I should move the tests to a separate file.
Yes, this increases the size when cloning the repo, though the pip package is unaffected. I'll investigate how the artifacts can be uploaded to the hub instead. |
Description Refactor all tuners (where it applies, i.e. not prompt tuning) to use the "base layer pattern". This means that the adapter layer will always hold a reference to the original layer that it modifies. This pattern is already partly used (e.g. LoRA bnb, gptq layers), now it is consistently used everywhere when applicable. This PR is a companion PR to #1069, where I first added these changes. They are now extracted to a separate PR to make code review easier and to advance more quickly. Implementation The main change is that the adapter layer wraps the original layer and calls forward on that layer, instead of doing stuff like this: F.linear(input, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) which completely circumvents the call to the target layer's forward method. With the base layer pattern, we now call the target layer's forward method. Therefore, if the target layer is another adapter layer (which will be crucial for mixed adapters), we call its forward method correctly. Also, this should allow passing extra arguments, like lora_scale to forward. This change has the nice side benefit that we no longer need to use _init_empty_weights -- in fact, we don't initialize any of the target layer's weights anymore, since we have a reference to it. There is thus no risk of having slow but superfluous initialization of layers. Moreover, I could greatly simplify merge_and_unload by just using the base_layer instead of having to create a completely new layer. For OPT-350m, this results in a 15x speedup. Note that same as for the bnb layers, this should be backwards incompatible, since the adapter weights and their state_dicts are not affected by this change. I used #1115 for regression testing. Somewhat unrelated changes During debugging, I got very annoyed with the fact that the reprs of adapter layers and normal PyTorch layers are hard to distinguish, e.g. the type is just "Linear". Now, for adapter layers, it is prefixed by the adapter type, e.g. "lora.Linear". This should have no further implications except for the repr (e.g. state_dict remains unaffected). For LoHa and LoKr, I had to change the init of weights when using init_weights=False. This is because of what is discussed in Numerical instabilities with LoHa #1058. IA³ now has the unload method too. LoHa and LoKr now support safe_merge=True when merging layers. Migration guide For 99% of users, the code should continue working as ususal, because the API stays the same. Only low level details have been changed. Code that relies on isinstance checks on specific PEFT classes may break. E.g. the LoRA Linear layer no longer inherits from nn.Linear. It is, however, still a BaseTunerLayer. The same logic applies for other layer types like Conv2d and for other tuners like IA³. To retrieve the base layer of an adapter layer, you should now call module.get_base_layer() if you deal with a BaseTunerLayer. Don't rely on something like module.weight being present (though it might be).
Instead of storing them in the git repo. That way, the git repo stays lean.
@pacman100 I think your concern should be addressed. I changed the code so that all regression artifacts are stored on HF Hub. You can check them here: https://huggingface.co/peft-internal-testing/regression-tests/tree/main Note that I had to do a bit of trickery to get the regression artifacts for PEFT v0.6.2 because regression tests need to be run on a clean tagged commit, but of course for v0.6.2 the regression tests didn't exist yet. The output should be correct though. I also changed the code to use safetensors, since that's our new standard. In the future, if this PR is merged, no trickery should be necessary to create the regression artifacts. I ran a regression test on current main with the artifacts from v0.6.2 and they all passed. For now, the regression tests are not run automatically on CI. Once we agree if and how we want to do that, I can add that in a separate PR. ( |
This PR fixes a handful of issues with AdaLora, should resolve huggingface#1113. Description 1. lora_A.weight.device was called but for AdaLora, lora_A is a nn.Paramter, not an nn.Module, so the weight attribute does not exist. lora_A.device is sufficient. 2. For 8bit, an inplace operation failed because it was on a view. Now the operation is no longer inplace. 3. The loss term of the model output is not necessarily a torch tensor. In the test, it was a dict and did not contain an actual loss. Therefore, I added a check to make sure the loss is a torch tensor. Is there a better way? Notes Running pytest tests/test_gpu_examples.py -k adalora locally (with GPU) passes. Ideally, someone else can confirm, as normal unit tests won't catch this. If this is merged before huggingface#1115, skipping AdaLora tests in that PR can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @BenjaminBossan for these great efforts!
Can you also add a job to run regression tests on our daily CI for bnb and GPU tests through your new makefile command: https://github.com/huggingface/peft/blob/main/.github/workflows/nightly.yml#L48
tests/regression/test_regression.py
Outdated
def require_torch_gpu(test_case): | ||
""" | ||
Decorator marking a test that requires a GPU. Will be skipped when no GPU is available. | ||
|
||
Copies from tsting_utils.py. | ||
|
||
""" | ||
if not torch.cuda.is_available(): | ||
return unittest.skip("test requires GPU")(test_case) | ||
else: | ||
return test_case | ||
|
||
|
||
def require_bitsandbytes(test_case): | ||
""" | ||
Decorator marking a test that requires the bitsandbytes library. Will be skipped when the library is not installed. | ||
|
||
Copies from tsting_utils.py. | ||
|
||
""" | ||
try: | ||
import bitsandbytes # noqa: F401 | ||
except ImportError: | ||
return unittest.skip("test requires bitsandbytes")(test_case) | ||
else: | ||
return test_case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def require_torch_gpu(test_case): | |
""" | |
Decorator marking a test that requires a GPU. Will be skipped when no GPU is available. | |
Copies from tsting_utils.py. | |
""" | |
if not torch.cuda.is_available(): | |
return unittest.skip("test requires GPU")(test_case) | |
else: | |
return test_case | |
def require_bitsandbytes(test_case): | |
""" | |
Decorator marking a test that requires the bitsandbytes library. Will be skipped when the library is not installed. | |
Copies from tsting_utils.py. | |
""" | |
try: | |
import bitsandbytes # noqa: F401 | |
except ImportError: | |
return unittest.skip("test requires bitsandbytes")(test_case) | |
else: | |
return test_case |
would it be possible to use the ones from https://github.com/huggingface/peft/blob/main/tests/testing_utils.py ?
tests/regression/test_regression.py
Outdated
save_model(model, name, force=self.force_mode) | ||
|
||
def _assert_results_equal(self, name): | ||
# TODO: abstract |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove this TODO?
# TODO | ||
self.skipTest("Skipping AdaLora for now because of a bug, see #1113") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO | |
self.skipTest("Skipping AdaLora for now because of a bug, see #1113") |
#1113 Is merged, I guess we can remove it
@younesbelkada Thanks for the review. I tried importing the two copied functions but always get an
If you know a solution to this, please LMK. Regarding the AdaLora tests: I did a check and they're still failing. Not sure why. I'd propose to investigate those later. The unnecessary comment was removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! Can we also add a job on the nightly runners to run the regression tests on GPU?
Description In general, for regression tests, we need two steps: 1. Creating the regression artifacts, in this case the adapter checkpoint and the expected output of the model. 2. Running the regression tests, i.e. loading the adapter and checking that the output of the model is the same as the expected output. My approach is to re-use as much code as possible between those two steps. Therefore, the same test script can be used for both, with only an environment variable to distinguish between the two. Step 1 is invoked by calling: `REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py` and to run the second step, we call: `pytest tests/regression/test_regression.py` Creating regression artifacts The first step will create an adapter checkpoint and an output for the given PEFT version and test setting in a new directory. E.g. it will create a directory `tests/regression/lora_opt-125m_bnb_4bit/0.5.0/` that contains adapter_model.bin and output.pt. Before this step runs, there is a check that the git repo is clean (no dirty worktree) and that the commit is tagged (i.e. corresponds to a release version of PEFT). Otherwise, we may accidentally create regression artifacts that do not correspond to any PEFT release. The easiest way to get such a clean state (say, for PEFT v0.5.0) is by checking out a tagged commit, e.g: `git checkout v0.5.0` before running the first step. The first step will also skip the creation of regression artifacts if they already exist. It is possible to circumvent all the aforementioned checks by setting the environment variable `REGRESSION_FORCE_MODE` to True like so: `REGRESSION_FORCE_MODE=True REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py` You should only do this if you know exactly what you're doing. Running regression tests The second step is much simpler. It will load the adapters and the output created in the first step, and compare the output to the output from a new PEFT model using the loaded adapter. The outputs should be the same. If more than one version is discovered for a given test setting, all of them are tested. Notes Regression artifacts are stored on HF Hub.
This PR supersedes #995. The description below is copied and modified from that PR. For some technical reasons, it was easier for me to create a new PR than to update the previous one, sorry for that.
Description
In general, for regression tests, we need two steps:
My approach is to re-use as much code as possible between those two steps. Therefore, the same test script can be used for both, with only an environment variable to distinguish between the two. Step 1 is invoked by calling:
REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py
and to run the second step, we call:
pytest tests/regression/test_regression.py
Creating regression artifacts
The first step will create an adapter checkpoint and an output for the given PEFT version and test setting in a new directory. E.g. it will create a directory
tests/regression/lora_opt-125m_bnb_4bit/0.5.0/
that contains adapter_model.bin and output.pt.Before this step runs, there is a check that the git repo is clean (no dirty worktree) and that the commit is tagged (i.e. corresponds to a release version of PEFT). Otherwise, we may accidentally create regression artifacts that do not correspond to any PEFT release.
The easiest way to get such a clean state (say, for PEFT v0.5.0) is by checking out a tagged commit, e.g:
git checkout v0.5.0
before running the first step.
The first step will also skip the creation of regression artifacts if they already exist.
It is possible to circumvent all the aforementioned checks by setting the environment variable
REGRESSION_FORCE_MODE
to True like so:REGRESSION_FORCE_MODE=True REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py
You should only do this if you know exactly what you're doing.
Running regression tests
The second step is much simpler. It will load the adapters and the output created in the first step, and compare the output to the output from a new PEFT model using the loaded adapter. The outputs should be the same.
If more than one version is discovered for a given test setting, all of them are tested.
Notes
As is, the adapters are stored in the git repo itself. Since they're relatively small, the total size of the repo is still reasonable. However, it could be better to store those adapters on HF Hub instead. This would, however, make things a bit more complicated (not sure how to parse directories etc. on Hub).
The regression tests in this included in this PR were used to check that #1106 still allows to load checkpoints created with PEFT v0.6.1.