-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resume_from_checkpoint
function fails because "There seems to be not a single sample in your epoch_iterator"
#26413
Comments
Update - I added
Checkpoint 3000 is my best checkpoint (according to my
Which is wrong. Am I missing something? |
Similar question: for the sake of reproducibility, I would like to be able to resume training from the same batch where I left off in my Is there anyway to achieve this behavior? Thanks! |
Is there any update from the team on the issues raised above? These issues make it prohibitively expensive or practically impossible to make use of an Alternatively, any advice on working with large datasets without using an |
CCing again: @muellerzr @pacman100 |
Hello, your iterable dataset should reiterate when reaching the end if the number of steps> number of samples in the iterable dataset. Best example of this is the ConstantLengthDataset from trl library. The main code snippet is given below when
Notice the logic in exception handling to reassign Hope this helps. |
This seems like a separate issue. Please open another one with a minimal reproducible example. Currently, the given details aren't enough for us to reproduce this. |
Done here: I'm closing this issue because |
I'm sorry but this response doesn't make sense To clarify, you are technically correct that "your iterable dataset should reiterate when reaching the end." However, the It is unclear why resuming from checkpoint causes them to fail to handle this. When not resuming from checkpoint, the training logic is as you expect: if you run out of samples in the current epoch but haven't reached max steps yet, you just start a new epoch until you do reach max steps. |
This issue should not be closed because I feel perhaps there is some fundamental miscommunication happening here because this seems very transparently obvious to me that this is not how this should work. I have had identical runs where:
This is clear, incontrovertible evidence of a bug since it indicates different training logic is happening depending on whether Let me put it another way: If you agree that
which implies
then do you agree that Which is going to be true in many instances? And yet this is precisely the condition upon which the error triggers, at least according to the error message. When not resuming from checkpoint, this simple mathematical fact poses no problem. It is only when resuming from checkpoint that for some reason this inequality poses a conundrum, and that is what makes no sense. |
I am just now realizing that the example dataset @pacman100 provides as a working solution is a fully written |
Hello @Ubadub, please provide a minimal reproducible example wrt this along with the related config, the launch command and the versions of the libraries. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@pacman100 Hello, I am also facing the same issue as @Ubadub is reporting. Here is my code to reproduce the issue: import os
import shutil
import transformers
import datasets
if os.path.exists("./output"):
shutil.rmtree("./output")
def my_generator():
for i in range(10):
yield {"input_ids": [1000], "labels": [1000]}
# This dataset yields 10 examples only, but let's set max_steps=20.
dataset = datasets.IterableDataset.from_generator(my_generator)
model = transformers.AutoModelForCausalLM.from_pretrained("gpt2")
args = transformers.TrainingArguments(
output_dir="./output",
per_device_train_batch_size=1,
max_steps=20,
save_steps=10,
report_to="none",
)
trainer = transformers.Trainer(
model=model,
args=args,
train_dataset=dataset,
)
trainer.train()
# Trainer runs 20 steps, producing both checkpoint-10 checkpoint-20.
assert os.path.exists("./output/checkpoint-10")
assert os.path.exists("./output/checkpoint-20")
# Now remove checkpoint-20 and resume training from checkpoint-10.
shutil.rmtree("./output/checkpoint-20")
trainer = transformers.Trainer(
model=model,
args=args,
train_dataset=dataset,
)
trainer.train(resume_from_checkpoint=True)
# This time, trainer does nothing. checkpoint-20 is not produced.
assert os.path.exists("./output/checkpoint-10")
assert not os.path.exists("./output/checkpoint-20") output:
When not resuming, Trainer runs until 20 steps. When resuming from a checkpoint, it tries to run until 10 steps. This seems inconsistent. As discussed in #26635, I think the correct behavior suggested by the current documentation of transformers/src/transformers/training_args.py Lines 236 to 239 in 95091e1
I'm using Python v3.10.12, transformers==4.36.2, datasets==2.16.1, accelerate==0.26.0, torch==2.1.2. |
@muupan Thank you for the minimal example, I had a lot on my plate and was unable to do that so I ended up just scrapping the use of this functionality altogether, but this introduced its own complications so I would really appreciate a fix for this. |
Confirming this issue should not be marked stale and still requires addressing. |
Gentle ping @muellerzr @pacman100 |
Gentle ping @muellerzr @SunMarc |
Does anyone have a workaround or a solution for this yet? |
My current workaround is making the dataset yield samples infinitely. In my example code, if I replace the definition of def my_generator():
while True:
for i in range(10):
yield {"input_ids": [1000], "labels": [1000]} , the resumed training continues until 20th steps correctly. However, this workaround has a drawback: 1 epoch now corresponds to |
I implemented a solution and opened a PR: #33544 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers version - 4.33.2
I'm using the trainer api as such, so it pushes the latest checkpoint to huggingface each epoch:
And after about 25 epochs there's some exception (never mind what). So I get the last checkpoint being saved to huggingface (from here, if it matters) and put it on my drive, change the training code to this:
And rerun the whole notebook. Than, it prints (after some time - not immidiatlly):
And than fails.
I do have an
IterableDataset
with 2000 training videos, and I'm using batch size 8 and want to run for 50 epochs, so I'm pretty sure 12500 is (2000/8)*50, but I still don't understand the message. Why is it problematic that num_steps (12500) > number of samples (2000)?Thank you!
Who can help?
@muellerzr
@pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Can't really for my code, but it is based on your guide and I believe will reproduce for this as well.
Expected behavior
Continuing the training from the same state it stopped before.
The text was updated successfully, but these errors were encountered: