Update CPT documentation #2229

tsachiblau · 2024-11-21T12:19:57Z

Currently, the CPT model lacks a code example.
In this pull request, I provide an explanation and a code example to address this.

Thanks,
Tsachi

… created _cpt_forward for readability, updated copyright to 2024, renamed class to CPTPromptInit, changed config variables to lowercase and list[int], removed exception catch from tests, added assertion docs, removed batch_size=1 test, and renamed test file to test_cpt.py.

…lization in config. Renamed cpt_prompt_tuning_init to cpt_prompt_init. Changed the class from PeftConfig to PromptLearningConfig. model: Removed check_config function. peft_model: Fixed bugs. tests: Added PeftTestConfigManagerForDecoderModels in test_decoder_models.py and testing_common.py.

…dded into _toctree.yml.

review-notebook-app · 2024-11-21T12:20:58Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

BenjaminBossan

Thanks for quickly following up with the example. Overall, this is a very nice notebook. I have some smaller comments, please check them out.

A tip: You should be able to run ruff on the notebook if you want to auto-format it: https://docs.astral.sh/ruff/faq/#does-ruff-support-jupyter-notebooks.

A question I had, is the code for CPTDataset, CPTDataCollatorForLanguageModeling etc. specifically for this example, or is there some reference code. If the latter, it would make sense to put a link to the reference code.

In addition, I had a bit of trouble getting this notebook to run due to OOM errors after I increased the dataset size. I tried a few common steps to mitigate this but nothing worked:

Smaller model: bigscience/bloom-560m
Reduced MAX_INPUT_LENGTH to 64
Set per_device_eval_batch_size to 1
Quantized the model with bitsandbytes (both 8 and 4 bit)

Even with all those steps combined, I couldn't train with 24GB of memory. Somewhat surprisingly, I had to reduce the number of train samples to 50 for training to run. I'd think that with a batch size of 1, it shouldn't matter that much for memory if I have 50 samples or 500. Do you have any idea what could be going on here?

BenjaminBossan · 2024-11-21T15:42:34Z

docs/source/conceptual_guides/prompting.md

@@ -90,4 +90,4 @@ In CPT, only specific context token embeddings are optimized, while the rest of
 To prevent overfitting and maintain stability, CPT uses controlled perturbations to limit the allowed changes to context embeddings within a defined range. 
 Additionally, to address the phenomenon of recency bias—where examples near the end of the context tend to be prioritized over earlier ones—CPT applies a decay loss factor.

-Take a look at [Context-Aware Prompt Tuning for few-shot classification](../task_guides/cpt-few-shot-classification) for a step-by-step guide on how to train a model with CPT.
+Take a look at [Example](../../../examples/cpt_finetuning/README.md) for a step-by-step guide on how to train a model with CPT.


Hmm, I'm not sure if this link is going to work from the built docs. It's better if you link directly to the README, i.e. https://github.com/huggingface/peft/blob/main/examples/cpt_finetuning/README.md (of course, the link won't point anywhere right now, but after merging it will be valid).

BenjaminBossan · 2024-11-21T15:42:40Z

docs/source/package_reference/cpt.md

@@ -9,6 +9,8 @@ Unless required by applicable law or agreed to in writing, software distributed
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.

+


BenjaminBossan · 2024-11-21T15:42:49Z

docs/source/package_reference/cpt.md

@@ -21,6 +23,9 @@ The abstract from the paper is:

 *Traditional fine-tuning is effective but computationally intensive, as it requires updating billions of parameters. CPT, inspired by ICL, PT, and adversarial attacks, refines context embeddings in a parameter-efficient manner. By optimizing context tokens and applying a controlled gradient descent, CPT achieves superior accuracy across various few-shot classification tasks, showing significant improvement over existing methods such as LoRA, PT, and ICL.*

+Take a look at [Example](../../../examples/cpt_finetuning/README.md) for a step-by-step guide on how to train a model with CPT.


Same argument about the link.

BenjaminBossan · 2024-11-21T15:44:59Z

examples/cpt_finetuning/README.md

+To overcome these challenges, we introduce Context-aware Prompt Tuning (CPT), a method inspired by ICL, Prompt Tuning (PT), and adversarial attacks. 
+CPT builds on the ICL strategy of concatenating examples before the input, extending it by incorporating PT-like learning to refine the context embedding through iterative optimization, extracting deeper insights from the training examples. Our approach carefully modifies specific context tokens, considering the unique structure of the examples within the context.
+
+In addition to updating the context with PT-like optimization, CPT draws inspiration from adversarial attacks, adjusting the input based on the labels present in the context while preserving the inherent value of the user-provided data. 
+To ensure robustness and stability during optimization, we employ a projected gradient descent algorithm, constraining token embeddings to remain close to their original values and safeguarding the quality of the context.
+Our method has demonstrated superior accuracy across multiple classification tasks using various LLM models, outperforming existing baselines and effectively addressing the overfitting challenge in few-shot learning.


In this section, you use a lot of "we" and "our". Let's try to word it in a more neutral way, as for the reader it could appear like "we" refers to the PEFT maintainers :) So use "The approach" instead of "Our approach" etc.

BenjaminBossan · 2024-11-21T15:46:32Z

examples/cpt_finetuning/README.md

+- Refer to **Section 3.1** of the paper, where template-based tokenization is described as a critical step in structuring inputs for CPT.
+
+#### How it Helps
+Templates provide context-aware structure, ensuring the model does not overfit by utilizing structured input-output formats. Using cpt_tokens_type_mask, we gain fine-grained information about the roles of different tokens in the input-output structure. This enables the model to:


Suggested change

Templates provide context-aware structure, ensuring the model does not overfit by utilizing structured input-output formats. Using cpt_tokens_type_mask, we gain fine-grained information about the roles of different tokens in the input-output structure. This enables the model to:

Templates provide context-aware structure, ensuring the model does not overfit by utilizing structured input-output formats. Using `cpt_tokens_type_mask`, we gain fine-grained information about the roles of different tokens in the input-output structure. This enables the model to:

BenjaminBossan · 2024-11-21T15:48:51Z