-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update CPT documentation #2229
Update CPT documentation #2229
Conversation
… created _cpt_forward for readability, updated copyright to 2024, renamed class to CPTPromptInit, changed config variables to lowercase and list[int], removed exception catch from tests, added assertion docs, removed batch_size=1 test, and renamed test file to test_cpt.py.
…lization in config. Renamed cpt_prompt_tuning_init to cpt_prompt_init. Changed the class from PeftConfig to PromptLearningConfig. model: Removed check_config function. peft_model: Fixed bugs. tests: Added PeftTestConfigManagerForDecoderModels in test_decoder_models.py and testing_common.py.
…dded into _toctree.yml.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for quickly following up with the example. Overall, this is a very nice notebook. I have some smaller comments, please check them out.
A tip: You should be able to run ruff on the notebook if you want to auto-format it: https://docs.astral.sh/ruff/faq/#does-ruff-support-jupyter-notebooks.
A question I had, is the code for CPTDataset
, CPTDataCollatorForLanguageModeling
etc. specifically for this example, or is there some reference code. If the latter, it would make sense to put a link to the reference code.
In addition, I had a bit of trouble getting this notebook to run due to OOM errors after I increased the dataset size. I tried a few common steps to mitigate this but nothing worked:
- Smaller model:
bigscience/bloom-560m
- Reduced
MAX_INPUT_LENGTH
to 64 - Set
per_device_eval_batch_size
to 1 - Quantized the model with bitsandbytes (both 8 and 4 bit)
Even with all those steps combined, I couldn't train with 24GB of memory. Somewhat surprisingly, I had to reduce the number of train samples to 50 for training to run. I'd think that with a batch size of 1, it shouldn't matter that much for memory if I have 50 samples or 500. Do you have any idea what could be going on here?
@@ -90,4 +90,4 @@ In CPT, only specific context token embeddings are optimized, while the rest of | |||
To prevent overfitting and maintain stability, CPT uses controlled perturbations to limit the allowed changes to context embeddings within a defined range. | |||
Additionally, to address the phenomenon of recency bias—where examples near the end of the context tend to be prioritized over earlier ones—CPT applies a decay loss factor. | |||
|
|||
Take a look at [Context-Aware Prompt Tuning for few-shot classification](../task_guides/cpt-few-shot-classification) for a step-by-step guide on how to train a model with CPT. | |||
Take a look at [Example](../../../examples/cpt_finetuning/README.md) for a step-by-step guide on how to train a model with CPT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure if this link is going to work from the built docs. It's better if you link directly to the README, i.e. https://github.com/huggingface/peft/blob/main/examples/cpt_finetuning/README.md
(of course, the link won't point anywhere right now, but after merging it will be valid).
docs/source/package_reference/cpt.md
Outdated
@@ -9,6 +9,8 @@ Unless required by applicable law or agreed to in writing, software distributed | |||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |||
specific language governing permissions and limitations under the License. | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
docs/source/package_reference/cpt.md
Outdated
@@ -21,6 +23,9 @@ The abstract from the paper is: | |||
|
|||
*Traditional fine-tuning is effective but computationally intensive, as it requires updating billions of parameters. CPT, inspired by ICL, PT, and adversarial attacks, refines context embeddings in a parameter-efficient manner. By optimizing context tokens and applying a controlled gradient descent, CPT achieves superior accuracy across various few-shot classification tasks, showing significant improvement over existing methods such as LoRA, PT, and ICL.* | |||
|
|||
Take a look at [Example](../../../examples/cpt_finetuning/README.md) for a step-by-step guide on how to train a model with CPT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same argument about the link.
examples/cpt_finetuning/README.md
Outdated
To overcome these challenges, we introduce Context-aware Prompt Tuning (CPT), a method inspired by ICL, Prompt Tuning (PT), and adversarial attacks. | ||
CPT builds on the ICL strategy of concatenating examples before the input, extending it by incorporating PT-like learning to refine the context embedding through iterative optimization, extracting deeper insights from the training examples. Our approach carefully modifies specific context tokens, considering the unique structure of the examples within the context. | ||
|
||
In addition to updating the context with PT-like optimization, CPT draws inspiration from adversarial attacks, adjusting the input based on the labels present in the context while preserving the inherent value of the user-provided data. | ||
To ensure robustness and stability during optimization, we employ a projected gradient descent algorithm, constraining token embeddings to remain close to their original values and safeguarding the quality of the context. | ||
Our method has demonstrated superior accuracy across multiple classification tasks using various LLM models, outperforming existing baselines and effectively addressing the overfitting challenge in few-shot learning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this section, you use a lot of "we" and "our". Let's try to word it in a more neutral way, as for the reader it could appear like "we" refers to the PEFT maintainers :) So use "The approach" instead of "Our approach" etc.
examples/cpt_finetuning/README.md
Outdated
- Refer to **Section 3.1** of the paper, where template-based tokenization is described as a critical step in structuring inputs for CPT. | ||
|
||
#### How it Helps | ||
Templates provide context-aware structure, ensuring the model does not overfit by utilizing structured input-output formats. Using cpt_tokens_type_mask, we gain fine-grained information about the roles of different tokens in the input-output structure. This enables the model to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Templates provide context-aware structure, ensuring the model does not overfit by utilizing structured input-output formats. Using cpt_tokens_type_mask, we gain fine-grained information about the roles of different tokens in the input-output structure. This enables the model to: | |
Templates provide context-aware structure, ensuring the model does not overfit by utilizing structured input-output formats. Using `cpt_tokens_type_mask`, we gain fine-grained information about the roles of different tokens in the input-output structure. This enables the model to: |
"cell_type": "markdown", | ||
"source": [ | ||
"# CPT Training and Inference\n", | ||
"This notebook demonstrates the training and evaluation process of Context-Aware Prompt Tuning (CPT) using the Hugging Face Trainer.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be helpful to link the paper here.
"}\n", | ||
"\n", | ||
"# Initialize the dataset\n", | ||
"CPT_train_dataset = CPTDataset(train_dataset, tokenizer, templates)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not capitalize here: cpt_train_dataset
"source": [ | ||
"# Load a pre-trained causal language model\n", | ||
"base_model = AutoModelForCausalLM.from_pretrained(\n", | ||
" 'bigscience/bloom-1b7',\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of hard-coding the model id here, can we re-use the tokenizer_name_or_path
variable? Of course, it should be renamed in this case, e.g. to model_id
.
"train_dataset = dataset['train'].select(range(4)).map(add_string_labels)\n", | ||
"\n", | ||
"# Subset and process the validation dataset\n", | ||
"test_dataset = dataset['validation'].select(range(20)).map(add_string_labels)\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you chose small subsets to make the notebook run fast. But maybe a little bit more would also be okay? Also, let's add a sentence here that for proper testing, users should use the whole dataset. Maybe there can even be a toggle that users can enable to use the full datasets.
" trust_remote_code=True,\n", | ||
" local_files_only=False,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we delete these arguments?
Done
Not specific to this example, but rather to our method.
This is true. Our method can become computationally expensive as the number of examples grows, as we concatenate all examples to the input and utilize them all during optimization.
Done
Done
Done
Done
Done
Done
Done
Done
Done
Done |
Thanks a lot for the updates.
Could you expand on this a little bit? I wonder if we could optimize this (e.g. via batching), since PEFT is intended for use with limited memory and ideally, an example as this one should be able to run with 24GB VRAM. Moreover, for some reason, the updated notebook does not render, I get:
Could you please check that everything is correct in the notebook? Perhaps, rerunning it from the start with an up-to-date jupyter version is enough. |
CPT is designed as a solution for few-shot scenarios, typically involving up to tens of examples and not exceeding this range. However, the high memory usage you experienced can be attributed to the following key factors: Self-Attention Mechanism: Gradient Storage:
Does it works now? I checked it on Colab. |
So this means that as a user, I should not use more than a few tens of examples? If so, let's make this very clear in the docs. If I use a dataset of 1000 samples, all of them would be used as few shot examples? Would it make sense to add an argument to limit this number?
Yes, this works now, thanks for the update. |
Thank you for the suggestion, and I will ensure this limitation is clearly stated in the documentation. Similar to In-Context Learning (ICL), there is indeed a practical limit to the number of examples that can be used in the context due to memory constraints. However, users can choose to select a subset of examples for the context while using the remaining examples purely for optimization. While this is not the original design of our method, it could serve as an adaptation to scale the approach for larger datasets with more examples. Here’s a suggestion to integrate the functionality of selecting a subset of examples directly into your code. This adjustment allows users to limit the number of examples chosen for the context dynamically: # Iterate through the CPT training dataset
for i in range(len(cpt_train_dataset)):
# Add input IDs to the context
context_ids += cpt_train_dataset[i]['input_ids']
# Add attention mask to the context
context_attention_mask += cpt_train_dataset[i]['attention_mask']
# Adjust and add the input type mask to the context
context_input_type_mask += [
i + first_type_mask if i > 0 else 0 # Increment type indices dynamically
for i in cpt_train_dataset[i]['input_type_mask']
]
# Increment the type mask offset after processing the sample
first_type_mask += 4 |
Thanks for the update to the docs and explaining further.
To clarify, is the selection of the subset already part of this snippet (which AFAICT is identical to the code in the notebook)? It looks like it iterates through the whole dataset. My guess is that if I wanted to set a limit, I would break out of the loop after a couple of iterations. Alternatively, my suggestion would be to define a constant like |
Done. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the documentation and example for CPT. Also nice addition of the ICL limit, I could now successfully run the notebook even with 1000 samples and had a lot of memory to spare.
Thanks so much again! :) Currently, the link to the example page is: https://github.com/huggingface/peft/blob/main/examples/cpt_finetuning/README.md, which points to the GitHub repository. Could we instead have a link on the official website, similar to this one: https://huggingface.co/docs/peft/main/en/package_reference/cpt? |
It's a pleasure. We can always make changes like that, but I'm not sure which link exactly you want to change. The links to https://github.com/huggingface/peft/blob/main/examples/cpt_finetuning/README.md are specifically for examples, whereas https://huggingface.co/docs/peft/main/en/package_reference/cpt is the more technical documentation and thus serves a different purpose. Thus, I don't think changing the link makes sense here. Maybe I'm missing something? |
No, it is fine 👍 |
Currently, the CPT model lacks a code example.
In this pull request, I provide an explanation and a code example to address this.
Thanks,
Tsachi