-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SFTTrainer
] Fix non packed dataset
#444
Conversation
The documentation is not available anymore as the PR was closed or merged. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
output_texts = [] | ||
for i in range(len(example['question'])): | ||
text = f"### Question: {example['question'][i]}\n ### Answer: {example['answer'][i]}" | ||
output_texts.append(text) | ||
return output_texts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting. So previously, we were dumping an entire dataset to the prompt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly yes :D the previous examples on the documentation were wrong and we were dumping the entire mini-batches when processing the dataset .. :/
outputs = tokenizer( | ||
element[dataset_text_field] if not use_formatting_func else formatting_func(element), | ||
truncation=True, | ||
padding=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@younesbelkada this code is still incorrect - consider the case where all samples in the dataset are less than max_seq_len
. Each batch will be padded to the largest element in the batch, but no data will pass the if length == max_seq_len
check below.
Perhaps:
padding='max_length',
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are correct thanks a lot for flagging, do you want to open a PR for that? happy to do it otherwise
Will this fine-tune the complete model end-to-end, or will this example fine-tune just a portion of it, like in Lora? |
The above example will train the full model but there are also options to use LoRA. |
Is this merged in main and then reverted? because padding is still False in _prepare_non_packed_dataloader in main (0.7.9) |
Actually padding was turned off by this PR: #512 Now running SFT on alpaca gives |
Any update on this issue? |
What does this PR do?
This PR properly educates users on how to correctly use
formatting_func
method when someone uses a non-packed dataset.Since the dataset processing calls
dataset.map(xxx, batched=True)
under the hood, it is important to return an array of processed texts to properly process all texts from the dataset example batch, otherwise it will lead to silent bugs that are hard to understand such as the one described in #439The PR adds a sanity check when processing the dataset, and adds the argument
padding=True,
to always return a sequence of lengthmax_seq_len
and correctly appends the attention mask to the output dataset as well.