Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST: Add regression tests 2 #1115

Merged

Conversation

BenjaminBossan
Copy link
Member

This PR supersedes #995. The description below is copied and modified from that PR. For some technical reasons, it was easier for me to create a new PR than to update the previous one, sorry for that.

Description

In general, for regression tests, we need two steps:

  1. Creating the regression artifacts, in this case the adapter checkpoint and the expected output of the model.
  2. Running the regression tests, i.e. loading the adapter and checking that the output of the model is the same as the expected output.

My approach is to re-use as much code as possible between those two steps. Therefore, the same test script can be used for both, with only an environment variable to distinguish between the two. Step 1 is invoked by calling:

REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py

and to run the second step, we call:

pytest tests/regression/test_regression.py

Creating regression artifacts

The first step will create an adapter checkpoint and an output for the given PEFT version and test setting in a new directory. E.g. it will create a directory tests/regression/lora_opt-125m_bnb_4bit/0.5.0/ that contains adapter_model.bin and output.pt.

Before this step runs, there is a check that the git repo is clean (no dirty worktree) and that the commit is tagged (i.e. corresponds to a release version of PEFT). Otherwise, we may accidentally create regression artifacts that do not correspond to any PEFT release.

The easiest way to get such a clean state (say, for PEFT v0.5.0) is by checking out a tagged commit, e.g:

git checkout v0.5.0

before running the first step.

The first step will also skip the creation of regression artifacts if they already exist.

It is possible to circumvent all the aforementioned checks by setting the environment variable REGRESSION_FORCE_MODE to True like so:

REGRESSION_FORCE_MODE=True REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py

You should only do this if you know exactly what you're doing.

Running regression tests

The second step is much simpler. It will load the adapters and the output created in the first step, and compare the output to the output from a new PEFT model using the loaded adapter. The outputs should be the same.

If more than one version is discovered for a given test setting, all of them are tested.

Notes

As is, the adapters are stored in the git repo itself. Since they're relatively small, the total size of the repo is still reasonable. However, it could be better to store those adapters on HF Hub instead. This would, however, make things a bit more complicated (not sure how to parse directories etc. on Hub).

The regression tests in this included in this PR were used to check that #1106 still allows to load checkpoints created with PEFT v0.6.1.

This PR supersedes huggingface#995. The description below is copied and modified
from that PR. For some technical reasons, it was easier for me to create
a new PR than to update the previous one, sorry for that.

Description

In general, for regression tests, we need two steps:

1. Creating the regression artifacts, in this case the adapter
   checkpoint and the expected output of the model.
2. Running the regression tests, i.e. loading the adapter and checking
   that the output of the model is the same as the expected output.

My approach is to re-use as much code as possible between those two
steps. Therefore, the same test script can be used for both, with only
an environment variable to distinguish between the two. Step 1 is
invoked by calling:

`REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py`

and to run the second step, we call:

`pytest tests/regression/test_regression.py`

Creating regression artifacts

The first step will create an adapter checkpoint and an output for the
given PEFT version and test setting in a new directory. E.g. it will
create a directory `tests/regression/lora_opt-125m_bnb_4bit/0.5.0/` that
contains adapter_model.bin and output.pt.

Before this step runs, there is a check that the git repo is clean (no
dirty worktree) and that the commit is tagged (i.e. corresponds to a
release version of PEFT). Otherwise, we may accidentally create
regression artifacts that do not correspond to any PEFT release.

The easiest way to get such a clean state (say, for PEFT v0.5.0) is by
checking out a tagged commit, e.g:

`git checkout v0.5.0`

before running the first step.

The first step will also skip the creation of regression artifacts if
they already exist.

It is possible to circumvent all the aforementioned checks by setting
the environment variable `REGRESSION_FORCE_MODE` to True like so:

`REGRESSION_FORCE_MODE=True REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py`

You should only do this if you know exactly what you're doing.

Running regression tests

The second step is much simpler. It will load the adapters and the
output created in the first step, and compare the output to the output
from a new PEFT model using the loaded adapter. The outputs should be
the same.

If more than one version is discovered for a given test setting, all of
them are tested.

Notes

As is, the adapters are stored in the git repo itself. Since they're
relatively small, the total size of the repo is still reasonable.
However, it could be better to store those adapters on HF Hub instead.
This would, however, make things a bit more complicated (not sure how to
parse directories etc. on Hub).

The regression tests in this included in this PR were used to check that
 huggingface#994 still allows to load checkpoints created with PEFT v0.6.1.
This was referenced Nov 10, 2023
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @BenjaminBossan for adding the extensive regression tests along with cmd configurations to automatically enable/skip regression tests 🚀!

One major issue for this approach is that it will bloat the package heavily and we should actively avoid that. I would prefer that we create the models and outputs and save them as revisions based on version on HF Hub and load it from there for the regression tests.

BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Nov 14, 2023
Previously, the corresponding tests were testing only whether specific
parameters had requires_grad True or False. Now, all parameters are
being checked. This is more rigorous.

Also, tests for Embedding, Conv1D, Conv2d were added, thus superseding
PR huggingface#1115.

Finally, tests for LoHa and LoKr were added.

Note

I considered moving the tests to a separate module, as they were getting
quite big and this would help with readability. For now, I left them in
the same module because it leads to a better diff view and is thus
easier to review. LMK if I should move the tests to a separate file.
pacman100 pushed a commit that referenced this pull request Nov 14, 2023
Previously, the corresponding tests were testing only whether specific
parameters had requires_grad True or False. Now, all parameters are
being checked. This is more rigorous.

Also, tests for Embedding, Conv1D, Conv2d were added, thus superseding
PR #1115.

Finally, tests for LoHa and LoKr were added.

Note

I considered moving the tests to a separate module, as they were getting
quite big and this would help with readability. For now, I left them in
the same module because it leads to a better diff view and is thus
easier to review. LMK if I should move the tests to a separate file.
@BenjaminBossan
Copy link
Member Author

One major issue for this approach is that it will bloat the package heavily and we should actively avoid that. I would prefer that we create the models and outputs and save them as revisions based on version on HF Hub and load it from there for the regression tests.

Yes, this increases the size when cloning the repo, though the pip package is unaffected. I'll investigate how the artifacts can be uploaded to the hub instead.

BenjaminBossan added a commit that referenced this pull request Nov 16, 2023
Description

Refactor all tuners (where it applies, i.e. not prompt tuning) to use
the "base layer pattern". This means that the adapter layer will always
hold a reference to the original layer that it modifies. This pattern is
already partly used (e.g. LoRA bnb, gptq layers), now it is consistently
used everywhere when applicable.

This PR is a companion PR to #1069, where I first added these changes.
They are now extracted to a separate PR to make code review easier and
to advance more quickly.

Implementation

The main change is that the adapter layer wraps the original layer and
calls forward on that layer, instead of doing stuff like this:

F.linear(input, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)

which completely circumvents the call to the target layer's forward
method. With the base layer pattern, we now call the target layer's
forward method. Therefore, if the target layer is another adapter
layer (which will be crucial for mixed adapters), we call its forward
method correctly. Also, this should allow passing extra arguments, like
lora_scale to forward.

This change has the nice side benefit that we no longer need to use
_init_empty_weights -- in fact, we don't initialize any of the target
layer's weights anymore, since we have a reference to it. There is thus
no risk of having slow but superfluous initialization of layers.

Moreover, I could greatly simplify merge_and_unload by just using the
base_layer instead of having to create a completely new layer. For
OPT-350m, this results in a 15x speedup.

Note that same as for the bnb layers, this should be backwards
incompatible, since the adapter weights and their state_dicts are not
affected by this change. I used #1115 for regression testing.

Somewhat unrelated changes

During debugging, I got very annoyed with the fact that the reprs of
adapter layers and normal PyTorch layers are hard to distinguish, e.g.
the type is just "Linear". Now, for adapter layers, it is prefixed by
the adapter type, e.g. "lora.Linear". This should have no further
implications except for the repr (e.g. state_dict remains unaffected).

For LoHa and LoKr, I had to change the init of weights when using
init_weights=False. This is because of what is discussed in Numerical
instabilities with LoHa #1058.

IA³ now has the unload method too.

LoHa and LoKr now support safe_merge=True when merging layers.

Migration guide

For 99% of users, the code should continue working as ususal, because
the API stays the same. Only low level details have been changed.

Code that relies on isinstance checks on specific PEFT classes may
break. E.g. the LoRA Linear layer no longer inherits from nn.Linear. It
is, however, still a BaseTunerLayer. The same logic applies for other
layer types like Conv2d and for other tuners like IA³.

To retrieve the base layer of an adapter layer, you should now call
module.get_base_layer() if you deal with a BaseTunerLayer. Don't rely on
something like module.weight being present (though it might be).
Instead of storing them in the git repo. That way, the git repo stays
lean.
@BenjaminBossan
Copy link
Member Author

BenjaminBossan commented Nov 16, 2023

@pacman100 I think your concern should be addressed. I changed the code so that all regression artifacts are stored on HF Hub. You can check them here: https://huggingface.co/peft-internal-testing/regression-tests/tree/main

Note that I had to do a bit of trickery to get the regression artifacts for PEFT v0.6.2 because regression tests need to be run on a clean tagged commit, but of course for v0.6.2 the regression tests didn't exist yet. The output should be correct though. I also changed the code to use safetensors, since that's our new standard.

In the future, if this PR is merged, no trickery should be necessary to create the regression artifacts.

I ran a regression test on current main with the artifacts from v0.6.2 and they all passed.

For now, the regression tests are not run automatically on CI. Once we agree if and how we want to do that, I can add that in a separate PR.

(Note that github shows Files changed 57 but those files have been cleaned up, so after a squash and merge, they should not be part of the git history GitHub updated the number, it was just slow)

BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Nov 17, 2023
This PR fixes a handful of issues with AdaLora, should resolve huggingface#1113.

Description

1. lora_A.weight.device was called but for AdaLora, lora_A is a
   nn.Paramter, not an nn.Module, so the weight attribute does not
   exist. lora_A.device is sufficient.
2. For 8bit, an inplace operation failed because it was on a view. Now
   the operation is no longer inplace.
3. The loss term of the model output is not necessarily a torch tensor.
   In the test, it was a dict and did not contain an actual loss.
   Therefore, I added a check to make sure the loss is a torch tensor.
   Is there a better way?

Notes

Running pytest tests/test_gpu_examples.py -k adalora locally (with GPU)
passes. Ideally, someone else can confirm, as normal unit tests won't
catch this.

If this is merged before huggingface#1115, skipping AdaLora tests in that PR can be
removed.
Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @BenjaminBossan for these great efforts!
Can you also add a job to run regression tests on our daily CI for bnb and GPU tests through your new makefile command: https://github.com/huggingface/peft/blob/main/.github/workflows/nightly.yml#L48

Comment on lines 118 to 143
def require_torch_gpu(test_case):
"""
Decorator marking a test that requires a GPU. Will be skipped when no GPU is available.

Copies from tsting_utils.py.

"""
if not torch.cuda.is_available():
return unittest.skip("test requires GPU")(test_case)
else:
return test_case


def require_bitsandbytes(test_case):
"""
Decorator marking a test that requires the bitsandbytes library. Will be skipped when the library is not installed.

Copies from tsting_utils.py.

"""
try:
import bitsandbytes # noqa: F401
except ImportError:
return unittest.skip("test requires bitsandbytes")(test_case)
else:
return test_case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def require_torch_gpu(test_case):
"""
Decorator marking a test that requires a GPU. Will be skipped when no GPU is available.
Copies from tsting_utils.py.
"""
if not torch.cuda.is_available():
return unittest.skip("test requires GPU")(test_case)
else:
return test_case
def require_bitsandbytes(test_case):
"""
Decorator marking a test that requires the bitsandbytes library. Will be skipped when the library is not installed.
Copies from tsting_utils.py.
"""
try:
import bitsandbytes # noqa: F401
except ImportError:
return unittest.skip("test requires bitsandbytes")(test_case)
else:
return test_case

would it be possible to use the ones from https://github.com/huggingface/peft/blob/main/tests/testing_utils.py ?

save_model(model, name, force=self.force_mode)

def _assert_results_equal(self, name):
# TODO: abstract
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove this TODO?

Comment on lines +600 to +601
# TODO
self.skipTest("Skipping AdaLora for now because of a bug, see #1113")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO
self.skipTest("Skipping AdaLora for now because of a bug, see #1113")

#1113 Is merged, I guess we can remove it

@BenjaminBossan
Copy link
Member Author

@younesbelkada Thanks for the review. I tried importing the two copied functions but always get an

ImportError: attempted relative import with no known parent package

If you know a solution to this, please LMK.

Regarding the AdaLora tests: I did a check and they're still failing. Not sure why. I'd propose to investigate those later.

The unnecessary comment was removed.

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Can we also add a job on the nightly runners to run the regression tests on GPU?

@BenjaminBossan BenjaminBossan merged commit 9fd788b into huggingface:main Dec 6, 2023
14 checks passed
@BenjaminBossan BenjaminBossan deleted the regression-testing-2 branch December 6, 2023 14:07
BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Dec 7, 2023
BenjaminBossan added a commit that referenced this pull request Dec 7, 2023
TaoSunVoyage pushed a commit to TaoSunVoyage/peft that referenced this pull request Dec 14, 2023
Description

In general, for regression tests, we need two steps:

1. Creating the regression artifacts, in this case the adapter
   checkpoint and the expected output of the model.
2. Running the regression tests, i.e. loading the adapter and checking
   that the output of the model is the same as the expected output.

My approach is to re-use as much code as possible between those two
steps. Therefore, the same test script can be used for both, with only
an environment variable to distinguish between the two. Step 1 is
invoked by calling:

`REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py`

and to run the second step, we call:

`pytest tests/regression/test_regression.py`

Creating regression artifacts

The first step will create an adapter checkpoint and an output for the
given PEFT version and test setting in a new directory. E.g. it will
create a directory `tests/regression/lora_opt-125m_bnb_4bit/0.5.0/` that
contains adapter_model.bin and output.pt.

Before this step runs, there is a check that the git repo is clean (no
dirty worktree) and that the commit is tagged (i.e. corresponds to a
release version of PEFT). Otherwise, we may accidentally create
regression artifacts that do not correspond to any PEFT release.

The easiest way to get such a clean state (say, for PEFT v0.5.0) is by
checking out a tagged commit, e.g:

`git checkout v0.5.0`

before running the first step.

The first step will also skip the creation of regression artifacts if
they already exist.

It is possible to circumvent all the aforementioned checks by setting
the environment variable `REGRESSION_FORCE_MODE` to True like so:

`REGRESSION_FORCE_MODE=True REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py`

You should only do this if you know exactly what you're doing.

Running regression tests

The second step is much simpler. It will load the adapters and the
output created in the first step, and compare the output to the output
from a new PEFT model using the loaded adapter. The outputs should be
the same.

If more than one version is discovered for a given test setting, all of
them are tested.

Notes

Regression artifacts are stored on HF Hub.
TaoSunVoyage pushed a commit to TaoSunVoyage/peft that referenced this pull request Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants