-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of steps per epoch #5449
Comments
This should work just fine.
Note: If you pass for train/val_dataloader or datamodule directly into the Here is a good approximation for the total number of steps.
Closing this issue as the answer should work out. Feel free to re-open it if it doesn't solve your problem. |
When using Traceback (most recent call last):
File "xCoFormer.py", line 169, in perform_tasks
fit(hparams)
File "xCoFormer.py", line 84, in fit
trainer.fit(model, datamodule=dm)
File "/home/celso/projects/venvs/xCoFormer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 456, in fit
self.accelerator_backend.setup(model)
File "/home/celso/projects/venvs/xCoFormer/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 52, in setup
self.setup_optimizers(model)
File "/home/celso/projects/venvs/xCoFormer/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 145, in setup_optimizers
optimizers, lr_schedulers, optimizer_frequencies = self.trainer.init_optimizers(model)
File "/home/celso/projects/venvs/xCoFormer/lib/python3.7/site-packages/pytorch_lightning/trainer/optimizers.py", line 30, in init_optimizers
optim_conf = model.configure_optimizers()
File "/home/celso/projects/xCoFormer/source/model/JointEncoder.py", line 44, in configure_optimizers
print("num_batches: ", len(self.train_dataloader()) / self.trainer.accumulate_grad_baches)
AttributeError: 'Trainer' object has no attribute 'accumulate_grad_baches' |
Did you mean create a new issue? |
@celsofranssa it's |
I got it, thanks! |
# default used by the Trainer
trainer = Trainer(limit_train_batches=1.0) |
@tchaton I don't think the |
I just stumbled upon the same problem, and tried Your misunderstanding is that Then again, I'm not opposed to a convenience function/parameter |
I was referring to the
|
Oh, that |
@property
def num_training_steps(self) -> int:
"""Total training steps inferred from datamodule and devices."""
if self.trainer.max_steps:
return self.trainer.max_steps
limit_batches = self.trainer.limit_train_batches
batches = len(self.train_dataloader())
batches = min(batches, limit_batches) if isinstance(limit_batches, int) else int(limit_batches * batches)
num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
if self.trainer.tpu_cores:
num_devices = max(num_devices, self.trainer.tpu_cores)
effective_accum = self.trainer.accumulate_grad_batches * num_devices
return (batches // effective_accum) * self.trainer.max_epochs @tsteffek @KevinMusgrave this shall work? pls, do let me know if I missed any edge cases, will update. |
@rohitgr7 Thanks, that seems to work 👍 |
So I've been thinking that I've seen that functionality somewhere, and indeed there is a Caveat: it's 0 at the time of Leaving that nugget of knowledge in case that helps someone. |
Does this still hold for multi-node training? I think |
The above functions did not yield the correct number of steps per epoch for me so I dug into the source code of progress.py
Now the function returns the same number of steps per epoch as the progress bar and can be invoked from |
Hey @cowwoc, I am not sure you should be using the progress bar as it counts the total number of batches across processes + adds validation. I believed the previous implementation was more accurate. Best, |
@tchaton Is it wrong to include the number of validation steps? Doesn't |
No, it doesn't. |
@tchaton Okay, thanks for the clarification. Are there any plans to add this functionality (getting the number of steps per epoch or total number of steps) into the official API? Also, what about @igolan89's question for multiple nodes? |
I think it might be a bit risky to add such a functionality since it calls |
To which I say... If the committers can't figure out how to implement this safely, what chance do end-users have? :) |
@tchaton do you mind reopening this issue? Since this seems to be such a non-trivial problem, perhaps lightning should provide |
Dear @RuRo, Done. We will re-consider this feature, but it is hard to estimate correctly due to varying dataloaders length / iterators, etc.. Best, |
@tchaton Thanks. A few potential solutions, that I would personally be okay with:
I am not sure, if the ( |
@rohitgr7 |
@eladsegal not using |
@rohitgr7 I just tried to use the code you provided and got an
error halfway through training. I use I think, I remember reading that |
@RuRo maybe. Since with dp I believe the total training steps doesn't scale with number of devices so the above code might not be correct for that usecase. Also its not robust since I wrote it a long time back. We are thinking to address this issue internally. Although community support/suggestions are always helpful :) |
Watch out. Version 1.5.0 changes the default value of
has to be changed to:
|
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Marking to keep it alive. |
any solution. I am getting this error ValueError: Tried to step 1392 times. The specified number of total steps is 1390.
if add it to , would not increase the computational time by loading data twice
|
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Not stale. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
See related PR: #11599 |
Hello @rohitgr7 |
Fixed with #11599 |
@carmocca does this work with multi node training? |
It should |
For def num_steps(self) -> int:
"""Get number of steps"""
# Accessing _data_source is flaky and might break
dataset = self.trainer.fit_loop._data_source.dataloader()
dataset_size = len(dataset)
num_devices = max(1, self.trainer.num_devices)
num_steps = dataset_size * self.trainer.max_epochs // (self.trainer.accumulate_grad_batches * num_devices)
return num_steps |
@ayansengupta17 In your example, I suggest you access |
I am using
|
It is important to return in configure_optimizers and then
works superb! |
I can confirm that Worked perfectly! |
Some learning rate schedulers as OneCycleLR requires the number of steps per epoch.
Then, how to get the number of steps in
configure_optimizers(self)
scope?Note: Training data is given during
Trainer
instantiation:The text was updated successfully, but these errors were encountered: