[`SFTTrainer`] Fix non packed dataset #444

younesbelkada · 2023-06-16T16:23:04Z

What does this PR do?

This PR properly educates users on how to correctly use formatting_func method when someone uses a non-packed dataset.
Since the dataset processing calls dataset.map(xxx, batched=True) under the hood, it is important to return an array of processed texts to properly process all texts from the dataset example batch, otherwise it will lead to silent bugs that are hard to understand such as the one described in #439

from datasets import load_dataset
from trl import SFTTrainer
import transformers

dataset = load_dataset("tatsu-lab/alpaca", split="train")

model = transformers.AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/opt-350m")

def formatting_prompts_func(examples):
    output_text = []
    for i in range(len(examples["instruction"])):
        instruction = examples["instruction"][i]
        input_text = examples["input"][i]
        response = examples["output"][i]

        if len(input_text) >= 2:
            text = f'''Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
            
            ### Instruction:
            {instruction}
            
            ### Input:
            {input_text}
            
            ### Response:
            {response}
            '''
        else:
            text = f'''Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
            
            ### Instruction:
            {instruction}
            
            ### Response:
            {response}
            '''
        output_text.append(text)

    return output_text

trainer = SFTTrainer(
    model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    formatting_func=formatting_prompts_func,
    max_seq_length=256,
    packing=False,
)

trainer.train()

The PR adds a sanity check when processing the dataset, and adds the argument padding=True, to always return a sequence of length max_seq_len and correctly appends the attention mask to the output dataset as well.

HuggingFaceDocBuilderDev · 2023-06-16T16:27:15Z

The documentation is not available anymore as the PR was closed or merged.

docs/source/sft_trainer.mdx

HuggingFaceDocBuilderDev · 2023-06-16T16:52:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

vwxyzjn · 2023-06-16T16:54:01Z

docs/source/sft_trainer.mdx

+    output_texts = []
+    for i in range(len(example['question'])):
+        text = f"### Question: {example['question'][i]}\n ### Answer: {example['answer'][i]}"
+        output_texts.append(text)
+    return output_texts


Oh interesting. So previously, we were dumping an entire dataset to the prompt?

Exactly yes :D the previous examples on the documentation were wrong and we were dumping the entire mini-batches when processing the dataset .. :/

PhilDakin · 2023-06-19T17:42:57Z

trl/trainer/sft_trainer.py

            outputs = tokenizer(
                element[dataset_text_field] if not use_formatting_func else formatting_func(element),
                truncation=True,
+                padding=True,


@younesbelkada this code is still incorrect - consider the case where all samples in the dataset are less than max_seq_len. Each batch will be padded to the largest element in the batch, but no data will pass the if length == max_seq_len check below.

Perhaps:

padding='max_length',

Yes you are correct thanks a lot for flagging, do you want to open a PR for that? happy to do it otherwise

ahmadmustafaanis · 2023-08-01T10:05:14Z

Will this fine-tune the complete model end-to-end, or will this example fine-tune just a portion of it, like in Lora?

lvwerra · 2023-08-02T16:08:20Z

The above example will train the full model but there are also options to use LoRA.

hy-chen · 2024-01-18T19:47:55Z

Is this merged in main and then reverted? because padding is still False in _prepare_non_packed_dataloader in main (0.7.9)

hy-chen · 2024-01-18T19:50:51Z

Is this merged in main and then reverted? because padding is still False in _prepare_non_packed_dataloader in main (0.7.9)

Actually padding was turned off by this PR: #512

Now running SFT on alpaca gives
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (labelsin this case) have excessive nesting (inputs typelistwhere typeint is expected).

haochuan-li · 2024-09-17T07:30:14Z

Any update on this issue?

fix non packed dataset

dadc844

fixing tests and documentation

db3f190

younesbelkada commented Jun 16, 2023

View reviewed changes

docs/source/sft_trainer.mdx Outdated Show resolved Hide resolved

Update docs/source/sft_trainer.mdx

2c4a0c7

younesbelkada merged commit d1ad540 into main Jun 16, 2023

younesbelkada deleted the fix-sft-dataset branch June 16, 2023 16:51

vwxyzjn reviewed Jun 16, 2023

View reviewed changes

PhilDakin reviewed Jun 19, 2023

View reviewed changes

younesbelkada mentioned this pull request Jun 21, 2023

Bug when using epochs? #455

Closed

MaggieK410 mentioned this pull request Jun 24, 2023

Error because of .map in _prepare_non_packed_dataset in sft_trainer.py #465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`SFTTrainer`] Fix non packed dataset #444

[`SFTTrainer`] Fix non packed dataset #444

younesbelkada commented Jun 16, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 16, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 16, 2023

vwxyzjn Jun 16, 2023

younesbelkada Jun 16, 2023

PhilDakin Jun 19, 2023

younesbelkada Jun 19, 2023

ahmadmustafaanis commented Aug 1, 2023

lvwerra commented Aug 2, 2023

hy-chen commented Jan 18, 2024

hy-chen commented Jan 18, 2024

haochuan-li commented Sep 17, 2024

[SFTTrainer] Fix non packed dataset #444

[SFTTrainer] Fix non packed dataset #444

Conversation

younesbelkada commented Jun 16, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jun 16, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Jun 16, 2023

vwxyzjn Jun 16, 2023

Choose a reason for hiding this comment

younesbelkada Jun 16, 2023

Choose a reason for hiding this comment

PhilDakin Jun 19, 2023

Choose a reason for hiding this comment

younesbelkada Jun 19, 2023

Choose a reason for hiding this comment

ahmadmustafaanis commented Aug 1, 2023

lvwerra commented Aug 2, 2023

hy-chen commented Jan 18, 2024

hy-chen commented Jan 18, 2024

haochuan-li commented Sep 17, 2024

[`SFTTrainer`] Fix non packed dataset #444

[`SFTTrainer`] Fix non packed dataset #444

younesbelkada commented Jun 16, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 16, 2023 •

edited

Loading