TST: Add regression tests 2 #1115

BenjaminBossan · 2023-11-10T17:20:08Z

This PR supersedes #995. The description below is copied and modified from that PR. For some technical reasons, it was easier for me to create a new PR than to update the previous one, sorry for that.

Description

In general, for regression tests, we need two steps:

Creating the regression artifacts, in this case the adapter checkpoint and the expected output of the model.
Running the regression tests, i.e. loading the adapter and checking that the output of the model is the same as the expected output.

My approach is to re-use as much code as possible between those two steps. Therefore, the same test script can be used for both, with only an environment variable to distinguish between the two. Step 1 is invoked by calling:

REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py

and to run the second step, we call:

pytest tests/regression/test_regression.py

Creating regression artifacts

The first step will create an adapter checkpoint and an output for the given PEFT version and test setting in a new directory. E.g. it will create a directory tests/regression/lora_opt-125m_bnb_4bit/0.5.0/ that contains adapter_model.bin and output.pt.

Before this step runs, there is a check that the git repo is clean (no dirty worktree) and that the commit is tagged (i.e. corresponds to a release version of PEFT). Otherwise, we may accidentally create regression artifacts that do not correspond to any PEFT release.

The easiest way to get such a clean state (say, for PEFT v0.5.0) is by checking out a tagged commit, e.g:

git checkout v0.5.0

before running the first step.

The first step will also skip the creation of regression artifacts if they already exist.

It is possible to circumvent all the aforementioned checks by setting the environment variable REGRESSION_FORCE_MODE to True like so:

REGRESSION_FORCE_MODE=True REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py

You should only do this if you know exactly what you're doing.

Running regression tests

The second step is much simpler. It will load the adapters and the output created in the first step, and compare the output to the output from a new PEFT model using the loaded adapter. The outputs should be the same.

If more than one version is discovered for a given test setting, all of them are tested.

Notes

As is, the adapters are stored in the git repo itself. Since they're relatively small, the total size of the repo is still reasonable. However, it could be better to store those adapters on HF Hub instead. This would, however, make things a bit more complicated (not sure how to parse directories etc. on Hub).

The regression tests in this included in this PR were used to check that #1106 still allows to load checkpoints created with PEFT v0.6.1.

This PR supersedes huggingface#995. The description below is copied and modified from that PR. For some technical reasons, it was easier for me to create a new PR than to update the previous one, sorry for that. Description In general, for regression tests, we need two steps: 1. Creating the regression artifacts, in this case the adapter checkpoint and the expected output of the model. 2. Running the regression tests, i.e. loading the adapter and checking that the output of the model is the same as the expected output. My approach is to re-use as much code as possible between those two steps. Therefore, the same test script can be used for both, with only an environment variable to distinguish between the two. Step 1 is invoked by calling: `REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py` and to run the second step, we call: `pytest tests/regression/test_regression.py` Creating regression artifacts The first step will create an adapter checkpoint and an output for the given PEFT version and test setting in a new directory. E.g. it will create a directory `tests/regression/lora_opt-125m_bnb_4bit/0.5.0/` that contains adapter_model.bin and output.pt. Before this step runs, there is a check that the git repo is clean (no dirty worktree) and that the commit is tagged (i.e. corresponds to a release version of PEFT). Otherwise, we may accidentally create regression artifacts that do not correspond to any PEFT release. The easiest way to get such a clean state (say, for PEFT v0.5.0) is by checking out a tagged commit, e.g: `git checkout v0.5.0` before running the first step. The first step will also skip the creation of regression artifacts if they already exist. It is possible to circumvent all the aforementioned checks by setting the environment variable `REGRESSION_FORCE_MODE` to True like so: `REGRESSION_FORCE_MODE=True REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py` You should only do this if you know exactly what you're doing. Running regression tests The second step is much simpler. It will load the adapters and the output created in the first step, and compare the output to the output from a new PEFT model using the loaded adapter. The outputs should be the same. If more than one version is discovered for a given test setting, all of them are tested. Notes As is, the adapters are stored in the git repo itself. Since they're relatively small, the total size of the repo is still reasonable. However, it could be better to store those adapters on HF Hub instead. This would, however, make things a bit more complicated (not sure how to parse directories etc. on Hub). The regression tests in this included in this PR were used to check that huggingface#994 still allows to load checkpoints created with PEFT v0.6.1.

HuggingFaceDocBuilderDev · 2023-11-10T17:26:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

pacman100

Thank you @BenjaminBossan for adding the extensive regression tests along with cmd configurations to automatically enable/skip regression tests 🚀!

One major issue for this approach is that it will bloat the package heavily and we should actively avoid that. I would prefer that we create the models and outputs and save them as revisions based on version on HF Hub and load it from there for the regression tests.

Previously, the corresponding tests were testing only whether specific parameters had requires_grad True or False. Now, all parameters are being checked. This is more rigorous. Also, tests for Embedding, Conv1D, Conv2d were added, thus superseding PR huggingface#1115. Finally, tests for LoHa and LoKr were added. Note I considered moving the tests to a separate module, as they were getting quite big and this would help with readability. For now, I left them in the same module because it leads to a better diff view and is thus easier to review. LMK if I should move the tests to a separate file.

Previously, the corresponding tests were testing only whether specific parameters had requires_grad True or False. Now, all parameters are being checked. This is more rigorous. Also, tests for Embedding, Conv1D, Conv2d were added, thus superseding PR #1115. Finally, tests for LoHa and LoKr were added. Note I considered moving the tests to a separate module, as they were getting quite big and this would help with readability. For now, I left them in the same module because it leads to a better diff view and is thus easier to review. LMK if I should move the tests to a separate file.

BenjaminBossan · 2023-11-15T09:45:51Z

One major issue for this approach is that it will bloat the package heavily and we should actively avoid that. I would prefer that we create the models and outputs and save them as revisions based on version on HF Hub and load it from there for the regression tests.

Yes, this increases the size when cloning the repo, though the pip package is unaffected. I'll investigate how the artifacts can be uploaded to the hub instead.

Description Refactor all tuners (where it applies, i.e. not prompt tuning) to use the "base layer pattern". This means that the adapter layer will always hold a reference to the original layer that it modifies. This pattern is already partly used (e.g. LoRA bnb, gptq layers), now it is consistently used everywhere when applicable. This PR is a companion PR to #1069, where I first added these changes. They are now extracted to a separate PR to make code review easier and to advance more quickly. Implementation The main change is that the adapter layer wraps the original layer and calls forward on that layer, instead of doing stuff like this: F.linear(input, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) which completely circumvents the call to the target layer's forward method. With the base layer pattern, we now call the target layer's forward method. Therefore, if the target layer is another adapter layer (which will be crucial for mixed adapters), we call its forward method correctly. Also, this should allow passing extra arguments, like lora_scale to forward. This change has the nice side benefit that we no longer need to use _init_empty_weights -- in fact, we don't initialize any of the target layer's weights anymore, since we have a reference to it. There is thus no risk of having slow but superfluous initialization of layers. Moreover, I could greatly simplify merge_and_unload by just using the base_layer instead of having to create a completely new layer. For OPT-350m, this results in a 15x speedup. Note that same as for the bnb layers, this should be backwards incompatible, since the adapter weights and their state_dicts are not affected by this change. I used #1115 for regression testing. Somewhat unrelated changes During debugging, I got very annoyed with the fact that the reprs of adapter layers and normal PyTorch layers are hard to distinguish, e.g. the type is just "Linear". Now, for adapter layers, it is prefixed by the adapter type, e.g. "lora.Linear". This should have no further implications except for the repr (e.g. state_dict remains unaffected). For LoHa and LoKr, I had to change the init of weights when using init_weights=False. This is because of what is discussed in Numerical instabilities with LoHa #1058. IA³ now has the unload method too. LoHa and LoKr now support safe_merge=True when merging layers. Migration guide For 99% of users, the code should continue working as ususal, because the API stays the same. Only low level details have been changed. Code that relies on isinstance checks on specific PEFT classes may break. E.g. the LoRA Linear layer no longer inherits from nn.Linear. It is, however, still a BaseTunerLayer. The same logic applies for other layer types like Conv2d and for other tuners like IA³. To retrieve the base layer of an adapter layer, you should now call module.get_base_layer() if you deal with a BaseTunerLayer. Don't rely on something like module.weight being present (though it might be).

Instead of storing them in the git repo. That way, the git repo stays lean.

BenjaminBossan · 2023-11-16T17:04:30Z

@pacman100 I think your concern should be addressed. I changed the code so that all regression artifacts are stored on HF Hub. You can check them here: https://huggingface.co/peft-internal-testing/regression-tests/tree/main

Note that I had to do a bit of trickery to get the regression artifacts for PEFT v0.6.2 because regression tests need to be run on a clean tagged commit, but of course for v0.6.2 the regression tests didn't exist yet. The output should be correct though. I also changed the code to use safetensors, since that's our new standard.

In the future, if this PR is merged, no trickery should be necessary to create the regression artifacts.

I ran a regression test on current main with the artifacts from v0.6.2 and they all passed.

For now, the regression tests are not run automatically on CI. Once we agree if and how we want to do that, I can add that in a separate PR.

(~~Note that github shows Files changed 57 but those files have been cleaned up, so after a squash and merge, they should not be part of the git history~~ GitHub updated the number, it was just slow)

This PR fixes a handful of issues with AdaLora, should resolve huggingface#1113. Description 1. lora_A.weight.device was called but for AdaLora, lora_A is a nn.Paramter, not an nn.Module, so the weight attribute does not exist. lora_A.device is sufficient. 2. For 8bit, an inplace operation failed because it was on a view. Now the operation is no longer inplace. 3. The loss term of the model output is not necessarily a torch tensor. In the test, it was a dict and did not contain an actual loss. Therefore, I added a check to make sure the loss is a torch tensor. Is there a better way? Notes Running pytest tests/test_gpu_examples.py -k adalora locally (with GPU) passes. Ideally, someone else can confirm, as normal unit tests won't catch this. If this is merged before huggingface#1115, skipping AdaLora tests in that PR can be removed.

younesbelkada

Thanks a lot @BenjaminBossan for these great efforts!
Can you also add a job to run regression tests on our daily CI for bnb and GPU tests through your new makefile command: https://github.com/huggingface/peft/blob/main/.github/workflows/nightly.yml#L48

younesbelkada · 2023-12-05T16:57:53Z

tests/regression/test_regression.py

+def require_torch_gpu(test_case):
+    """
+    Decorator marking a test that requires a GPU. Will be skipped when no GPU is available.
+
+    Copies from tsting_utils.py.
+
+    """
+    if not torch.cuda.is_available():
+        return unittest.skip("test requires GPU")(test_case)
+    else:
+        return test_case
+
+
+def require_bitsandbytes(test_case):
+    """
+    Decorator marking a test that requires the bitsandbytes library. Will be skipped when the library is not installed.
+
+    Copies from tsting_utils.py.
+
+    """
+    try:
+        import bitsandbytes  # noqa: F401
+    except ImportError:
+        return unittest.skip("test requires bitsandbytes")(test_case)
+    else:
+        return test_case


Suggested change

def require_torch_gpu(test_case):

"""

Decorator marking a test that requires a GPU. Will be skipped when no GPU is available.

Copies from tsting_utils.py.

"""

if not torch.cuda.is_available():

return unittest.skip("test requires GPU")(test_case)

else:

return test_case

def require_bitsandbytes(test_case):

"""

Decorator marking a test that requires the bitsandbytes library. Will be skipped when the library is not installed.

Copies from tsting_utils.py.

"""

try:

import bitsandbytes # noqa: F401

except ImportError:

return unittest.skip("test requires bitsandbytes")(test_case)

else:

return test_case

would it be possible to use the ones from https://github.com/huggingface/peft/blob/main/tests/testing_utils.py ?

younesbelkada · 2023-12-05T16:59:06Z

tests/regression/test_regression.py

+            save_model(model, name, force=self.force_mode)
+
+    def _assert_results_equal(self, name):
+        # TODO: abstract


Shall we remove this TODO?

younesbelkada · 2023-12-05T17:00:11Z

tests/regression/test_regression.py

+        # TODO
+        self.skipTest("Skipping AdaLora for now because of a bug, see #1113")


Suggested change

# TODO

self.skipTest("Skipping AdaLora for now because of a bug, see #1113")

#1113 Is merged, I guess we can remove it

BenjaminBossan · 2023-12-05T17:31:26Z

@younesbelkada Thanks for the review. I tried importing the two copied functions but always get an

ImportError: attempted relative import with no known parent package

If you know a solution to this, please LMK.

Regarding the AdaLora tests: I did a check and they're still failing. Not sure why. I'd propose to investigate those later.

The unnecessary comment was removed.

younesbelkada

Thanks ! Can we also add a job on the nightly runners to run the regression tests on GPU?

Follow up to huggingface#1115

Follow up to #1115

Description In general, for regression tests, we need two steps: 1. Creating the regression artifacts, in this case the adapter checkpoint and the expected output of the model. 2. Running the regression tests, i.e. loading the adapter and checking that the output of the model is the same as the expected output. My approach is to re-use as much code as possible between those two steps. Therefore, the same test script can be used for both, with only an environment variable to distinguish between the two. Step 1 is invoked by calling: `REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py` and to run the second step, we call: `pytest tests/regression/test_regression.py` Creating regression artifacts The first step will create an adapter checkpoint and an output for the given PEFT version and test setting in a new directory. E.g. it will create a directory `tests/regression/lora_opt-125m_bnb_4bit/0.5.0/` that contains adapter_model.bin and output.pt. Before this step runs, there is a check that the git repo is clean (no dirty worktree) and that the commit is tagged (i.e. corresponds to a release version of PEFT). Otherwise, we may accidentally create regression artifacts that do not correspond to any PEFT release. The easiest way to get such a clean state (say, for PEFT v0.5.0) is by checking out a tagged commit, e.g: `git checkout v0.5.0` before running the first step. The first step will also skip the creation of regression artifacts if they already exist. It is possible to circumvent all the aforementioned checks by setting the environment variable `REGRESSION_FORCE_MODE` to True like so: `REGRESSION_FORCE_MODE=True REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py` You should only do this if you know exactly what you're doing. Running regression tests The second step is much simpler. It will load the adapters and the output created in the first step, and compare the output to the output from a new PEFT model using the loaded adapter. The outputs should be the same. If more than one version is discovered for a given test setting, all of them are tested. Notes Regression artifacts are stored on HF Hub.

Follow up to huggingface#1115

This was referenced Nov 10, 2023

TST: Add regression tests #995

Closed

Refactor base layer pattern #1106

Merged

pacman100 reviewed Nov 14, 2023

View reviewed changes

BenjaminBossan mentioned this pull request Nov 14, 2023

TST: Improve requires grad testing: #1131

Merged

BenjaminBossan added 2 commits November 16, 2023 16:14

Merge branch 'main' into regression-testing-2

28fe1f1

Reviewer comment: Upload artifacts to HF Hub

f93947f

Instead of storing them in the git repo. That way, the git repo stays lean.

BenjaminBossan mentioned this pull request Nov 17, 2023

FIX: A few issues with AdaLora, extending GPU tests #1146

Merged

Merge branch 'main' into regression-testing-2

e68a42d

younesbelkada reviewed Dec 5, 2023

View reviewed changes

Add comment to funcs, remove unnecessary comment

5039f60

younesbelkada approved these changes Dec 6, 2023

View reviewed changes

BenjaminBossan merged commit 9fd788b into huggingface:main Dec 6, 2023
14 checks passed

BenjaminBossan deleted the regression-testing-2 branch December 6, 2023 14:07

BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Dec 7, 2023

TST Run regression test in nightly test runner

28392a4

Follow up to huggingface#1115

BenjaminBossan mentioned this pull request Dec 7, 2023

TST: Run regression test in nightly test runner #1233

Merged

BenjaminBossan added a commit that referenced this pull request Dec 7, 2023

TST Run regression test in nightly test runner (#1233)

2ab005f

Follow up to #1115

TaoSunVoyage pushed a commit to TaoSunVoyage/peft that referenced this pull request Dec 14, 2023

TST Run regression test in nightly test runner (huggingface#1233)

9ac3869

Follow up to huggingface#1115

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST: Add regression tests 2 #1115

TST: Add regression tests 2 #1115

BenjaminBossan commented Nov 10, 2023

HuggingFaceDocBuilderDev commented Nov 10, 2023

pacman100 left a comment

BenjaminBossan commented Nov 15, 2023

BenjaminBossan commented Nov 16, 2023 •

edited

Loading

younesbelkada left a comment

younesbelkada Dec 5, 2023

younesbelkada Dec 5, 2023

younesbelkada Dec 5, 2023

BenjaminBossan commented Dec 5, 2023

younesbelkada left a comment

		# TODO
		self.skipTest("Skipping AdaLora for now because of a bug, see #1113")

TST: Add regression tests 2 #1115

TST: Add regression tests 2 #1115

Conversation

BenjaminBossan commented Nov 10, 2023

Description

Creating regression artifacts

Running regression tests

Notes

HuggingFaceDocBuilderDev commented Nov 10, 2023

pacman100 left a comment

Choose a reason for hiding this comment

BenjaminBossan commented Nov 15, 2023

BenjaminBossan commented Nov 16, 2023 • edited Loading

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada Dec 5, 2023

Choose a reason for hiding this comment

younesbelkada Dec 5, 2023

Choose a reason for hiding this comment

younesbelkada Dec 5, 2023

Choose a reason for hiding this comment

BenjaminBossan commented Dec 5, 2023

younesbelkada left a comment

Choose a reason for hiding this comment

BenjaminBossan commented Nov 16, 2023 •

edited

Loading