Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CPT documentation #2229

Merged
merged 40 commits into from
Nov 29, 2024
Merged

Update CPT documentation #2229

merged 40 commits into from
Nov 29, 2024

Conversation

tsachiblau
Copy link
Contributor

Currently, the CPT model lacks a code example.
In this pull request, I provide an explanation and a code example to address this.

Thanks,
Tsachi

tsachiblau and others added 30 commits October 22, 2024 10:57
… created _cpt_forward for readability, updated copyright to 2024, renamed class to CPTPromptInit, changed config variables to lowercase and list[int], removed exception catch from tests, added assertion docs, removed batch_size=1 test, and renamed test file to test_cpt.py.
…lization in config. Renamed cpt_prompt_tuning_init to cpt_prompt_init. Changed the class from PeftConfig to PromptLearningConfig.

model: Removed check_config function.

peft_model: Fixed bugs.

tests: Added PeftTestConfigManagerForDecoderModels in test_decoder_models.py and testing_common.py.
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for quickly following up with the example. Overall, this is a very nice notebook. I have some smaller comments, please check them out.

A tip: You should be able to run ruff on the notebook if you want to auto-format it: https://docs.astral.sh/ruff/faq/#does-ruff-support-jupyter-notebooks.

A question I had, is the code for CPTDataset, CPTDataCollatorForLanguageModeling etc. specifically for this example, or is there some reference code. If the latter, it would make sense to put a link to the reference code.

In addition, I had a bit of trouble getting this notebook to run due to OOM errors after I increased the dataset size. I tried a few common steps to mitigate this but nothing worked:

  • Smaller model: bigscience/bloom-560m
  • Reduced MAX_INPUT_LENGTH to 64
  • Set per_device_eval_batch_size to 1
  • Quantized the model with bitsandbytes (both 8 and 4 bit)

Even with all those steps combined, I couldn't train with 24GB of memory. Somewhat surprisingly, I had to reduce the number of train samples to 50 for training to run. I'd think that with a batch size of 1, it shouldn't matter that much for memory if I have 50 samples or 500. Do you have any idea what could be going on here?

@@ -90,4 +90,4 @@ In CPT, only specific context token embeddings are optimized, while the rest of
To prevent overfitting and maintain stability, CPT uses controlled perturbations to limit the allowed changes to context embeddings within a defined range.
Additionally, to address the phenomenon of recency bias—where examples near the end of the context tend to be prioritized over earlier ones—CPT applies a decay loss factor.

Take a look at [Context-Aware Prompt Tuning for few-shot classification](../task_guides/cpt-few-shot-classification) for a step-by-step guide on how to train a model with CPT.
Take a look at [Example](../../../examples/cpt_finetuning/README.md) for a step-by-step guide on how to train a model with CPT.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure if this link is going to work from the built docs. It's better if you link directly to the README, i.e. https://github.com/huggingface/peft/blob/main/examples/cpt_finetuning/README.md (of course, the link won't point anywhere right now, but after merging it will be valid).

@@ -9,6 +9,8 @@ Unless required by applicable law or agreed to in writing, software distributed
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

@@ -21,6 +23,9 @@ The abstract from the paper is:

*Traditional fine-tuning is effective but computationally intensive, as it requires updating billions of parameters. CPT, inspired by ICL, PT, and adversarial attacks, refines context embeddings in a parameter-efficient manner. By optimizing context tokens and applying a controlled gradient descent, CPT achieves superior accuracy across various few-shot classification tasks, showing significant improvement over existing methods such as LoRA, PT, and ICL.*

Take a look at [Example](../../../examples/cpt_finetuning/README.md) for a step-by-step guide on how to train a model with CPT.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same argument about the link.

Comment on lines 6 to 11
To overcome these challenges, we introduce Context-aware Prompt Tuning (CPT), a method inspired by ICL, Prompt Tuning (PT), and adversarial attacks.
CPT builds on the ICL strategy of concatenating examples before the input, extending it by incorporating PT-like learning to refine the context embedding through iterative optimization, extracting deeper insights from the training examples. Our approach carefully modifies specific context tokens, considering the unique structure of the examples within the context.

In addition to updating the context with PT-like optimization, CPT draws inspiration from adversarial attacks, adjusting the input based on the labels present in the context while preserving the inherent value of the user-provided data.
To ensure robustness and stability during optimization, we employ a projected gradient descent algorithm, constraining token embeddings to remain close to their original values and safeguarding the quality of the context.
Our method has demonstrated superior accuracy across multiple classification tasks using various LLM models, outperforming existing baselines and effectively addressing the overfitting challenge in few-shot learning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this section, you use a lot of "we" and "our". Let's try to word it in a more neutral way, as for the reader it could appear like "we" refers to the PEFT maintainers :) So use "The approach" instead of "Our approach" etc.

- Refer to **Section 3.1** of the paper, where template-based tokenization is described as a critical step in structuring inputs for CPT.

#### How it Helps
Templates provide context-aware structure, ensuring the model does not overfit by utilizing structured input-output formats. Using cpt_tokens_type_mask, we gain fine-grained information about the roles of different tokens in the input-output structure. This enables the model to:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Templates provide context-aware structure, ensuring the model does not overfit by utilizing structured input-output formats. Using cpt_tokens_type_mask, we gain fine-grained information about the roles of different tokens in the input-output structure. This enables the model to:
Templates provide context-aware structure, ensuring the model does not overfit by utilizing structured input-output formats. Using `cpt_tokens_type_mask`, we gain fine-grained information about the roles of different tokens in the input-output structure. This enables the model to:

"cell_type": "markdown",
"source": [
"# CPT Training and Inference\n",
"This notebook demonstrates the training and evaluation process of Context-Aware Prompt Tuning (CPT) using the Hugging Face Trainer.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be helpful to link the paper here.

"}\n",
"\n",
"# Initialize the dataset\n",
"CPT_train_dataset = CPTDataset(train_dataset, tokenizer, templates)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not capitalize here: cpt_train_dataset

"source": [
"# Load a pre-trained causal language model\n",
"base_model = AutoModelForCausalLM.from_pretrained(\n",
" 'bigscience/bloom-1b7',\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of hard-coding the model id here, can we re-use the tokenizer_name_or_path variable? Of course, it should be renamed in this case, e.g. to model_id.

Comment on lines 234 to 237
"train_dataset = dataset['train'].select(range(4)).map(add_string_labels)\n",
"\n",
"# Subset and process the validation dataset\n",
"test_dataset = dataset['validation'].select(range(20)).map(add_string_labels)\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you chose small subsets to make the notebook run fast. But maybe a little bit more would also be okay? Also, let's add a sentence here that for proper testing, users should use the whole dataset. Maybe there can even be a toggle that users can enable to use the full datasets.

Comment on lines 529 to 530
" trust_remote_code=True,\n",
" local_files_only=False,\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we delete these arguments?

@tsachiblau
Copy link
Contributor Author

A tip: You should be able to run ruff on the notebook if you want to auto-format it: https://docs.astral.sh/ruff/faq/#does-ruff-support-jupyter-notebooks.

Done

A question I had, is the code for CPTDataset, CPTDataCollatorForLanguageModeling etc. specifically for this example, or is there some reference code. If the latter, it would make sense to put a link to the reference code.

Not specific to this example, but rather to our method.

Even with all those steps combined, I couldn't train with 24GB of memory. Somewhat surprisingly, I had to reduce the number of train samples to 50 for training to run. I'd think that with a batch size of 1, it shouldn't matter that much for memory if I have 50 samples or 500. Do you have any idea what could be going on here?

This is true. Our method can become computationally expensive as the number of examples grows, as we concatenate all examples to the input and utilize them all during optimization.

Hmm, I'm not sure if this link is going to work from the built docs. It's better if you link directly to the README, i.e. https://github.com/huggingface/peft/blob/main/examples/cpt_finetuning/README.md (of course, the link won't point anywhere right now, but after merging it will be valid).

Done

Remove?

Done

Same argument about the link.

Done

In this section, you use a lot of "we" and "our". Let's try to word it in a more neutral way, as for the reader it could appear like "we" refers to the PEFT maintainers :) So use "The approach" instead of "Our approach" etc.

Done

cpt_tokens_type_mask

Done

It could be helpful to link the paper here.

Done

Let's not capitalize here: cpt_train_dataset

Done

Instead of hard-coding the model id here, can we re-use the tokenizer_name_or_path variable? Of course, it should be renamed in this case, e.g. to model_id.

Done

I assume you chose small subsets to make the notebook run fast. But maybe a little bit more would also be okay? Also, let's add a sentence here that for proper testing, users should use the whole dataset. Maybe there can even be a toggle that users can enable to use the full datasets.

Done

Can we delete these arguments?

Done

@BenjaminBossan
Copy link
Member

Thanks a lot for the updates.

This is true. Our method can become computationally expensive as the number of examples grows, as we concatenate all examples to the input and utilize them all during optimization.

Could you expand on this a little bit? I wonder if we could optimize this (e.g. via batching), since PEFT is intended for use with limited memory and ideally, an example as this one should be able to run with 24GB VRAM.

Moreover, for some reason, the updated notebook does not render, I get:

'execution_count' is a required property
Using nbformat v5.10.4 and nbconvert v7.16.1

Could you please check that everything is correct in the notebook? Perhaps, rerunning it from the start with an up-to-date jupyter version is enough.

@tsachiblau
Copy link
Contributor Author

Could you expand on this a little bit? I wonder if we could optimize this (e.g. via batching), since PEFT is intended for use with limited memory and ideally, an example as this one should be able to run with 24GB VRAM.

CPT is designed as a solution for few-shot scenarios, typically involving up to tens of examples and not exceeding this range. However, the high memory usage you experienced can be attributed to the following key factors:

Self-Attention Mechanism:
The self-attention mechanism in transformers has a quadratic memory requirement with respect to the input sequence length. Concatenating multiple examples into the input significantly increases the sequence length, leading to a dramatic rise in memory usage during both forward and backward passes. This is a fundamental limitation of the self-attention mechanism in large transformer models and directly impacts the feasibility of handling very long inputs.

Gradient Storage:
During optimization, memory consumption scales with the number of tokens involved in the loss calculation. In CPT, we utilize more tokens for the loss by including not only the target examples but also tokens from the context. This ensures that the optimization process benefits from the additional information available in the context, but it also increases memory requirements. For each token used in the loss, gradients and activations need to be stored, leading to a proportional increase in memory consumption as more tokens are included.

Could you please check that everything is correct in the notebook? Perhaps, rerunning it from the start with an up-to-date jupyter version is enough.

Does it works now? I checked it on Colab.

@BenjaminBossan
Copy link
Member

typically involving up to tens of examples and not exceeding this range

So this means that as a user, I should not use more than a few tens of examples? If so, let's make this very clear in the docs.

If I use a dataset of 1000 samples, all of them would be used as few shot examples? Would it make sense to add an argument to limit this number?

Does it works now? I checked it on Colab.

Yes, this works now, thanks for the update.

@tsachiblau
Copy link
Contributor Author

tsachiblau commented Nov 28, 2024

Thank you for the suggestion, and I will ensure this limitation is clearly stated in the documentation.

Similar to In-Context Learning (ICL), there is indeed a practical limit to the number of examples that can be used in the context due to memory constraints. However, users can choose to select a subset of examples for the context while using the remaining examples purely for optimization. While this is not the original design of our method, it could serve as an adaptation to scale the approach for larger datasets with more examples.

Here’s a suggestion to integrate the functionality of selecting a subset of examples directly into your code. This adjustment allows users to limit the number of examples chosen for the context dynamically:

# Iterate through the CPT training dataset
for i in range(len(cpt_train_dataset)):
    # Add input IDs to the context
    context_ids += cpt_train_dataset[i]['input_ids']

    # Add attention mask to the context
    context_attention_mask += cpt_train_dataset[i]['attention_mask']

    # Adjust and add the input type mask to the context
    context_input_type_mask += [
        i + first_type_mask if i > 0 else 0 # Increment type indices dynamically
        for i in cpt_train_dataset[i]['input_type_mask']
        ]

    # Increment the type mask offset after processing the sample
    first_type_mask += 4

@BenjaminBossan
Copy link
Member

Thanks for the update to the docs and explaining further.

Here’s a suggestion to integrate the functionality of selecting a subset of examples directly into your code. This adjustment allows users to limit the number of examples chosen for the context dynamically:

To clarify, is the selection of the subset already part of this snippet (which AFAICT is identical to the code in the notebook)? It looks like it iterates through the whole dataset. My guess is that if I wanted to set a limit, I would break out of the loop after a couple of iterations.

Alternatively, my suggestion would be to define a constant like MAX_ICL_SAMPLES = 10 in the notebook and then define a separate icl_dataset with MAX_ICL_SAMPLES samples, which is distinct from the train_dataset (as I don't think it really helps if the few shot samples and the train dataset overlap). Then the first 10 samples would be used for ICL and the train dataset would start after sample 10 and could be much bigger without blowing up memory. WDYT?

@tsachiblau
Copy link
Contributor Author

Alternatively, my suggestion would be to define a constant like MAX_ICL_SAMPLES = 10 in the notebook and then define a separate icl_dataset with MAX_ICL_SAMPLES samples, which is distinct from the train_dataset (as I don't think it really helps if the few shot samples and the train dataset overlap). Then the first 10 samples would be used for ICL and the train dataset would start after sample 10 and could be much bigger without blowing up memory. WDYT?

Done.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the documentation and example for CPT. Also nice addition of the ICL limit, I could now successfully run the notebook even with 1000 samples and had a lot of memory to spare.

@BenjaminBossan BenjaminBossan merged commit 3f9ce55 into huggingface:main Nov 29, 2024
14 checks passed
@tsachiblau
Copy link
Contributor Author

Thanks so much again! :)

Currently, the link to the example page is: https://github.com/huggingface/peft/blob/main/examples/cpt_finetuning/README.md, which points to the GitHub repository. Could we instead have a link on the official website, similar to this one: https://huggingface.co/docs/peft/main/en/package_reference/cpt?

@BenjaminBossan
Copy link
Member

It's a pleasure.

We can always make changes like that, but I'm not sure which link exactly you want to change. The links to https://github.com/huggingface/peft/blob/main/examples/cpt_finetuning/README.md are specifically for examples, whereas https://huggingface.co/docs/peft/main/en/package_reference/cpt is the more technical documentation and thus serves a different purpose. Thus, I don't think changing the link makes sense here. Maybe I'm missing something?

@tsachiblau
Copy link
Contributor Author

No, it is fine 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants