BOFT bug fix when saving #1994

zqiu24 · 2024-08-07T11:32:49Z

fixing non-contiguous tensor when saving the model after merge_and_unload()

…load()

BenjaminBossan · 2024-08-07T12:15:55Z

Thanks for adding this fix. Do you have an example where lack of contiguity leads to an error?

zqiu24 · 2024-08-07T12:20:08Z

Thanks for adding this fix. Do you have an example where lack of contiguity leads to an error?

sure, here is where I noticed the error, a small code snippet for training step:

    oft_config = BOFTConfig(
        boft_block_size=args.block_size, # 32
        boft_n_butterfly_factor=args.n_butterfly_factor, # 1
        boft_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
    )

    train_data.start_iteration = 0

    print("Starting main loop")

    training_args = SFTConfig(
        output_dir=args.output_dir,
        dataloader_drop_last=True,
        eval_strategy="steps",
        num_train_epochs=args.num_train_epochs,
        eval_steps=args.eval_freq,
        save_strategy=args.save_strategy,
        logging_steps=args.log_freq,
        per_device_train_batch_size=args.batch_size,
        per_device_eval_batch_size=args.batch_size,
        learning_rate=args.learning_rate,
        lr_scheduler_type=args.lr_scheduler_type,
        warmup_steps=args.num_warmup_steps,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        gradient_checkpointing=args.gradient_checkpointing,
        fp16=args.fp16,
        bf16=args.bf16,
        weight_decay=args.weight_decay,
        run_name="llama-7b-finetuned",
        report_to="wandb",
        ddp_find_unused_parameters=False,
        disable_tqdm=False,
        max_seq_length=args.seq_length,
        dataset_text_field = "text",
    )

    model = AutoModelForCausalLM.from_pretrained(
        args.model_path, 
    )

    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=train_data,
        peft_config=oft_config,
        packing=False,
    )

    print_trainable_parameters(trainer.model)

    trainer.train()
    
    print("Saving last checkpoint of the model")

    trainer.model = trainer.model.merge_and_unload()
    trainer.save_model(os.path.join(args.output_dir, str(args.num_train_epochs)))
    ```
    error occur at trainer.save_model, it states that some tensors are not  contiguous, I noticed it is because of the torch.transpose() and torch.mm() operation performed during the def merge() function

BenjaminBossan

sure, here is where I noticed the error, a small code snippet for training step:

Thanks, I can confirm that this raises the error. Let's add a unit test to ensure that this works correctly. I think the best location would be after this line:

peft/tests/testing_common.py

Line 764 in c869664

assert torch.allclose(logits_peft, logits_unloaded, atol=atol, rtol=rtol)

Here is the code that I used:

        # serializing works without errors
        with tempfile.TemporaryDirectory() as tmp_dirname:
            # serializing with torch.save works
            torch.save(model_unloaded.state_dict(), os.path.join(tmp_dirname, "model.bin"))

            # serializing with safetensors works
            save_file(model_unloaded.state_dict(), os.path.join(tmp_dirname, "model.safetensors"))

The save_file function must be imported: from safetensors.torch import save_file.

When I ran these extended tests, I noticed that OFT is also failing. However, this can be resolved in the same way in this line:

peft/src/peft/tuners/oft/layer.py

Line 174 in c869664

base_layer.weight.data = new_weights

-                base_layer.weight.data = new_weights
+                base_layer.weight.data = new_weights.contiguous()

Would you be so kind to fix that too?

BenjaminBossan · 2024-08-07T12:43:26Z

src/peft/tuners/boft/layer.py

-                    orig_weight = torch.transpose(orig_weight, 0, 1)
+                    orig_weight = torch.transpose(orig_weight, 0, 1).contiguous()
+                    orig_weight = torch.mm(butterfly_oft_mat, orig_weight).contiguous()
+                    orig_weight = torch.transpose(orig_weight, 0, 1).contiguous()


I wonder if we can simplify this by only calling .contiguous() at the very last step, i.e. line 532:

- self.base_layer.weight.data = orig_weight + self.base_layer.weight.data = orig_weight.contiguous()

Same for line 532. And since there is also Conv2d, should the call be added there as well? So that would be lines 820 and 834.

zqiu24 · 2024-08-07T14:12:03Z

Hi, thanks for the comment. I updated the unit test and also the self.base_layer.weight.data = orig_weight.contiguous() for BOFT and OFT. Best,

BenjaminBossan

Thanks for the updates. Just two very small change requests, otherwise this looks good.

BenjaminBossan · 2024-08-07T14:16:56Z

tests/testing_common.py

@@ -763,6 +763,10 @@ def _test_safe_merge(self, model_id, config_cls, config_kwargs):
        # check that the logits are the same after unloading
        assert torch.allclose(logits_peft, logits_unloaded, atol=atol, rtol=rtol)

+        # serializing with safetensors works
+        from safetensors.torch import save_file


Let's move this import to the top of the file.

BenjaminBossan · 2024-08-07T14:17:39Z

tests/testing_common.py

@@ -763,6 +763,10 @@ def _test_safe_merge(self, model_id, config_cls, config_kwargs):
        # check that the logits are the same after unloading
        assert torch.allclose(logits_peft, logits_unloaded, atol=atol, rtol=rtol)

+        # serializing with safetensors works


Suggested change

# serializing with safetensors works

# Ensure that serializing with safetensors works, there was an error when weights were not contiguous

zqiu24 · 2024-08-07T14:21:54Z

Can you check again? I noticed in the unit test I did not copy the whole test in the previous commit. Best.

HuggingFaceDocBuilderDev · 2024-08-07T14:42:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zqiu24 · 2024-08-07T15:14:55Z

@BenjaminBossan What do I need to change in testing_common.py?

ruff check src tests examples docs scripts docker
All checks passed!
ruff format --check src tests examples docs scripts docker
Would reformat: tests/testing_common.py
1 file would be reformatted, 188 files already formatted
make: *** [Makefile:10: quality] Error 1
Error: Process completed with exit code 2.

BenjaminBossan · 2024-08-07T15:18:00Z

I think your last commit was the missing piece 🤞

zqiu24 · 2024-08-07T15:54:51Z

@BenjaminBossan Hi, the tests are still failing, are they because of my added code?

BenjaminBossan · 2024-08-07T15:58:42Z

Hi, the tests are still failing, are they because of my added code?

No, I don't think so, this looks more like a HF Hub issue. This can sometimes happen, but not with the whole test matrix. I'll restart later, don't worry about it.

BenjaminBossan

Thanks for the fixes to merging BOFT and OFT.

fixing non-contiguous tensor when saving the model after merge_and_un…

435e518

…load()

BenjaminBossan requested changes Aug 7, 2024

View reviewed changes

Zeju added 2 commits August 7, 2024 16:02

fixing non contigous error and adding unit test in testing_common.py

cac49e6

update oft / fixing non contigous when saving

e3e2f79

BenjaminBossan requested changes Aug 7, 2024

View reviewed changes

Zeju added 2 commits August 7, 2024 16:18

update unit test in testing_common for save

ac319b0

update unit test in testing_common for save

2758cf9

changing code after running make style

746bc4c

Zeju added 2 commits August 7, 2024 16:46

changing code after running make style

4909102

changing code to pass make quality

38ae778

changing code to pass make quality

f72d42a

BenjaminBossan approved these changes Aug 7, 2024

View reviewed changes

BenjaminBossan merged commit 9988cb9 into huggingface:main Aug 7, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BOFT bug fix when saving #1994

BOFT bug fix when saving #1994

zqiu24 commented Aug 7, 2024

BenjaminBossan commented Aug 7, 2024

zqiu24 commented Aug 7, 2024

BenjaminBossan left a comment

BenjaminBossan Aug 7, 2024

zqiu24 commented Aug 7, 2024

BenjaminBossan left a comment

BenjaminBossan Aug 7, 2024

BenjaminBossan Aug 7, 2024

zqiu24 commented Aug 7, 2024

HuggingFaceDocBuilderDev commented Aug 7, 2024

zqiu24 commented Aug 7, 2024

BenjaminBossan commented Aug 7, 2024

zqiu24 commented Aug 7, 2024

BenjaminBossan commented Aug 7, 2024

BenjaminBossan left a comment

	# serializing with safetensors works
	# Ensure that serializing with safetensors works, there was an error when weights were not contiguous

BOFT bug fix when saving #1994

BOFT bug fix when saving #1994

Conversation

zqiu24 commented Aug 7, 2024

BenjaminBossan commented Aug 7, 2024

zqiu24 commented Aug 7, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

BenjaminBossan Aug 7, 2024

Choose a reason for hiding this comment

zqiu24 commented Aug 7, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

BenjaminBossan Aug 7, 2024

Choose a reason for hiding this comment

BenjaminBossan Aug 7, 2024

Choose a reason for hiding this comment

zqiu24 commented Aug 7, 2024

HuggingFaceDocBuilderDev commented Aug 7, 2024

zqiu24 commented Aug 7, 2024

BenjaminBossan commented Aug 7, 2024

zqiu24 commented Aug 7, 2024

BenjaminBossan commented Aug 7, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment