Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of steps per epoch #5449

Closed
celsofranssa opened this issue Jan 10, 2021 · 46 comments
Closed

Number of steps per epoch #5449

celsofranssa opened this issue Jan 10, 2021 · 46 comments
Labels
question Further information is requested waiting on author Waiting on user action, correction, or update

Comments

@celsofranssa
Copy link

Some learning rate schedulers as OneCycleLR requires the number of steps per epoch.

Then, how to get the number of steps in configure_optimizers(self) scope?

Note: Training data is given during Trainer instantiation:

# training
    dm.setup('fit')
    trainer.fit(model, datamodule=dm)
@celsofranssa celsofranssa added the question Further information is requested label Jan 10, 2021
@tchaton
Copy link
Contributor

tchaton commented Jan 11, 2021

This should work just fine.

def configure_optimizers(self):
       num_batches = len(self.train_dataloader()) / self.trainer.accumulate_grad_batches
      ....
      return [optim..], [scheduler...]

Note: If you pass for train/val_dataloader or datamodule directly into the .fit function, Lightning will override the train_dataloader() function with the provided one, so it should work fine.

Here is a good approximation for the total number of steps.

    @property
    def num_training_steps(self) -> int:
        """Total training steps inferred from datamodule and devices."""
        dataset = self.train_dataloader()
        if self.trainer.max_steps:
            return self.trainer.max_steps

        dataset_size = (
            self.trainer.limit_train_batches
            if self.trainer.limit_train_batches != 0
            else len(dataset)
        )

        num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
        if self.trainer.tpu_cores:
            num_devices = max(num_devices, self.trainer.tpu_cores)

        effective_batch_size = dataset.batch_size * self.trainer.accumulate_grad_batches * num_devices
        return (dataset_size // effective_batch_size) * self.trainer.max_epochs

Closing this issue as the answer should work out. Feel free to re-open it if it doesn't solve your problem.

@tchaton tchaton added the waiting on author Waiting on user action, correction, or update label Jan 11, 2021
@tchaton tchaton closed this as completed Jan 11, 2021
@celsofranssa
Copy link
Author

When using num_batches = len(self.train_dataloader()) / self.trainer.accumulate_grad_baches, the following error happens:

Traceback (most recent call last):
  File "xCoFormer.py", line 169, in perform_tasks
    fit(hparams)
  File "xCoFormer.py", line 84, in fit
    trainer.fit(model, datamodule=dm)
  File "/home/celso/projects/venvs/xCoFormer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 456, in fit
    self.accelerator_backend.setup(model)
  File "/home/celso/projects/venvs/xCoFormer/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 52, in setup
    self.setup_optimizers(model)
  File "/home/celso/projects/venvs/xCoFormer/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 145, in setup_optimizers
    optimizers, lr_schedulers, optimizer_frequencies = self.trainer.init_optimizers(model)
  File "/home/celso/projects/venvs/xCoFormer/lib/python3.7/site-packages/pytorch_lightning/trainer/optimizers.py", line 30, in init_optimizers
    optim_conf = model.configure_optimizers()
  File "/home/celso/projects/xCoFormer/source/model/JointEncoder.py", line 44, in configure_optimizers
    print("num_batches: ", len(self.train_dataloader()) / self.trainer.accumulate_grad_baches)
AttributeError: 'Trainer' object has no attribute 'accumulate_grad_baches'

@celsofranssa
Copy link
Author

Closing this issue as the answer should work out. Feel free to re-open it if it doesn't solve your problem.

Did you mean create a new issue?

@rohitgr7
Copy link
Contributor

rohitgr7 commented Jan 11, 2021

@celsofranssa it's self.trainer.accumulate_grad_batches. just a typo there.

@celsofranssa
Copy link
Author

I got it, thanks!

@celsofranssa
Copy link
Author

This should work just fine.

def configure_optimizers(self):
       num_batches = len(self.train_dataloader()) / self.trainer.accumulate_grad_batches
      ....
      return [optim..], [scheduler...]

Note: If you pass for train/val_dataloader or datamodule directly into the .fit function, Lightning will override the train_dataloader() function with the provided one, so it should work fine.

Here is a good approximation for the total number of steps.

    @property
    def num_training_steps(self) -> int:
        """Total training steps inferred from datamodule and devices."""
        dataset = self.train_dataloader()
        if self.trainer.max_steps:
            return self.trainer.max_steps

        dataset_size = (
            self.trainer.limit_train_batches
            if self.trainer.limit_train_batches != 0
            else len(dataset)
        )

        num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
        if self.trainer.tpu_cores:
            num_devices = max(num_devices, self.trainer.tpu_cores)

        effective_batch_size = dataset.batch_size * self.trainer.accumulate_grad_batches * num_devices
        return (dataset_size // effective_batch_size) * self.trainer.max_epochs

Closing this issue as the answer should work out. Feel free to re-open it if it doesn't solve your problem.

dataset_size is being set to 1.0 instead of the real number of batches in the dataloader because:

# default used by the Trainer
trainer = Trainer(limit_train_batches=1.0)

@KevinMusgrave
Copy link

@tchaton I don't think the num_training_steps function works. As @celsofranssa pointed out, dataset_size gets set to 1, so the function returns 0 because (dataset_size // effective_batch_size) equals 0.

@tsteffek
Copy link

tsteffek commented Feb 5, 2021

I just stumbled upon the same problem, and tried len(self.train_dataloader()) // self.trainer.accumulate_grad_batches. Works for me.

Your misunderstanding is that dataset_size sets the size to 1. Note that the default is 1.0, a float. Floats will be used to scale the actual size: The default is therefore actual_size * 1.0.

Then again, I'm not opposed to a convenience function/parameter steps_per_epoch.

@KevinMusgrave
Copy link

KevinMusgrave commented Feb 5, 2021

I was referring to the num_training_steps function:

Here is a good approximation for the total number of steps.

    @property
    def num_training_steps(self) -> int:
        """Total training steps inferred from datamodule and devices."""
        dataset = self.train_dataloader()
        if self.trainer.max_steps:
            return self.trainer.max_steps

        dataset_size = (
            self.trainer.limit_train_batches
            if self.trainer.limit_train_batches != 0
            else len(dataset)
        )

        num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
        if self.trainer.tpu_cores:
            num_devices = max(num_devices, self.trainer.tpu_cores)

        effective_batch_size = dataset.batch_size * self.trainer.accumulate_grad_batches * num_devices
        return (dataset_size // effective_batch_size) * self.trainer.max_epochs

@tsteffek
Copy link

tsteffek commented Feb 5, 2021

Oh, that dataset_size. Same name everywhere, it's too confusing. Sorry then.

@rohitgr7
Copy link
Contributor

rohitgr7 commented Feb 5, 2021

@property
def num_training_steps(self) -> int:
    """Total training steps inferred from datamodule and devices."""
    if self.trainer.max_steps:
        return self.trainer.max_steps

    limit_batches = self.trainer.limit_train_batches
    batches = len(self.train_dataloader())
    batches = min(batches, limit_batches) if isinstance(limit_batches, int) else int(limit_batches * batches)     

    num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
    if self.trainer.tpu_cores:
        num_devices = max(num_devices, self.trainer.tpu_cores)

    effective_accum = self.trainer.accumulate_grad_batches * num_devices
    return (batches // effective_accum) * self.trainer.max_epochs

@tsteffek @KevinMusgrave this shall work? pls, do let me know if I missed any edge cases, will update.

@KevinMusgrave
Copy link

@rohitgr7 Thanks, that seems to work 👍

@tsteffek
Copy link

tsteffek commented Feb 5, 2021

So I've been thinking that I've seen that functionality somewhere, and indeed there is a trainer.num_training_batches.

Caveat: it's 0 at the time of configure_optimizers, sadly.

Leaving that nugget of knowledge in case that helps someone.

@izikgo
Copy link

izikgo commented Jun 25, 2021

@property
def num_training_steps(self) -> int:
    """Total training steps inferred from datamodule and devices."""
    if self.trainer.max_steps:
        return self.trainer.max_steps

    limit_batches = self.trainer.limit_train_batches
    batches = len(self.train_dataloader())
    batches = min(batches, limit_batches) if isinstance(limit_batches, int) else int(limit_batches * batches)     

    num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
    if self.trainer.tpu_cores:
        num_devices = max(num_devices, self.trainer.tpu_cores)

    effective_accum = self.trainer.accumulate_grad_batches * num_devices
    return (batches // effective_accum) * self.trainer.max_epochs

@tsteffek @KevinMusgrave this shall work? pls, do let me know if I missed any edge cases, will update.

Does this still hold for multi-node training? I think self.trainer.num_gpus only registers number of gpus per host, not the global amount. num_devices should probably be multiplied by self.trainer.num_nodes

@cowwoc
Copy link
Contributor

cowwoc commented Sep 3, 2021

The above functions did not yield the correct number of steps per epoch for me so I dug into the source code of progress.py on_train_epoch_start(self, trainer, pl_module) and came up with this:

    @property
    def total_train_batches(self) -> int:
        """
        The total number of training batches during training, which may change from epoch to epoch.
        """
        return math.ceil(len(self.train_dataset) / self.batch_size)

    @property
    def total_val_batches(self) -> int:
        """
        The total number of validation batches during validation, which may change from epoch to epoch.
        """
        total_val_batches = 0
        if self.trainer.enable_validation:
            is_val_epoch = (self.trainer.current_epoch + 1) % self.trainer.check_val_every_n_epoch == 0
            total_val_batches = math.ceil(len(self.val_dataset) / self.batch_size) if is_val_epoch else 0
        return total_val_batches

    @property
    def training_steps_per_epoch(self) -> int:
        """
        The number of training steps per epoch.

        Taken from progress.py on_train_epoch_start(self, trainer, pl_module)
        """
        total_train_batches = self.total_train_batches
        total_val_batches = self.total_val_batches

        # val can be checked multiple times per epoch
        val_check_batch = int(total_train_batches * self.trainer.val_check_interval)
        val_check_batch = max(1, val_check_batch)
        val_checks_per_epoch = total_train_batches // val_check_batch

        total_val_batches = total_val_batches * val_checks_per_epoch
        return total_train_batches + total_val_batches

Now the function returns the same number of steps per epoch as the progress bar and can be invoked from configure_optimizers(). I hope this helps someone else.

@tchaton
Copy link
Contributor

tchaton commented Sep 3, 2021

Hey @cowwoc,

I am not sure you should be using the progress bar as it counts the total number of batches across processes + adds validation.

I believed the previous implementation was more accurate.

Best,
T.C

@cowwoc
Copy link
Contributor

cowwoc commented Sep 3, 2021

@tchaton Is it wrong to include the number of validation steps? Doesn't lr_scheduler.step() get invoked during validation as well?

@tchaton
Copy link
Contributor

tchaton commented Sep 3, 2021

No, it doesn't.

@cowwoc
Copy link
Contributor

cowwoc commented Sep 3, 2021

@tchaton Okay, thanks for the clarification. Are there any plans to add this functionality (getting the number of steps per epoch or total number of steps) into the official API?

Also, what about @igolan89's question for multiple nodes?

@rohitgr7
Copy link
Contributor

rohitgr7 commented Sep 5, 2021

Are there any plans to add this functionality (getting the number of steps per epoch or total number of steps) into the official API?

I think it might be a bit risky to add such a functionality since it calls len(self.train_dataloader()) and we have an argument reload_dataloader_every_n_epoch. So there is a possibility of not having the same number of batches in each training epoch and since configure_optimizers is called just once, they might not sync up. There might be other issues as well since this depends on some dynamic variables which are prone to change while training.

@cowwoc
Copy link
Contributor

cowwoc commented Sep 5, 2021

To which I say... If the committers can't figure out how to implement this safely, what chance do end-users have? :)

@RuRo
Copy link
Contributor

RuRo commented Oct 20, 2021

@tchaton do you mind reopening this issue? Since this seems to be such a non-trivial problem, perhaps lightning should provide steps_per_epoch (or num_training_steps) property out of the box?

@tchaton tchaton reopened this Oct 21, 2021
@tchaton
Copy link
Contributor

tchaton commented Oct 21, 2021

Dear @RuRo,

Done. We will re-consider this feature, but it is hard to estimate correctly due to varying dataloaders length / iterators, etc..

Best,
T.C

@RuRo
Copy link
Contributor

RuRo commented Oct 21, 2021

@tchaton Thanks. A few potential solutions, that I would personally be okay with:

  1. Just assume that the train_dataloader length is deterministic w.r.t the hparams/arguments/passed train_dataloader/datamodule/etc. Blame the user if it is not. 😁

  2. Instead of adding a @property, add something like a current_steps_per_epoch value (emphasis on "current_").

    • Automatically update that value after each train_dataloader call.

    • Call configure_optimizers after the first train_dataloader is already obtained.
      (Not sure if this is viable with the current train loop implementation)

    This doesn't really solve the problem if reload_dataloader_every_n_epoch is used, but at least the current_steps_per_epoch value itself is technically always correct and doesn't "lie" to the user.

I am not sure, if the (reload_dataloader_every_n_epoch + non-deterministic dataloader length) situation is common in actual use cases and is even worth worrying about. I can't really imagine a case, where the length of the dataloader is intentionally different after each call to train_dataloder.

@eladsegal
Copy link
Contributor

@property
def num_training_steps(self) -> int:
    """Total training steps inferred from datamodule and devices."""
    if self.trainer.max_steps:
        return self.trainer.max_steps

    limit_batches = self.trainer.limit_train_batches
    batches = len(self.train_dataloader())
    batches = min(batches, limit_batches) if isinstance(limit_batches, int) else int(limit_batches * batches)     

    num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
    if self.trainer.tpu_cores:
        num_devices = max(num_devices, self.trainer.tpu_cores)

    effective_accum = self.trainer.accumulate_grad_batches * num_devices
    return (batches // effective_accum) * self.trainer.max_epochs

@tsteffek @KevinMusgrave this shall work? pls, do let me know if I missed any edge cases, will update.

@rohitgr7
When using DDP trainer.num_training_batches returns the number of batches after they are already divided across the GPUs, so no need to multiply by num_devices in this case

@rohitgr7
Copy link
Contributor

rohitgr7 commented Oct 22, 2021

@eladsegal not using trainer.num_training_batches anywhere in the code because that isn't computed till the point configure_optimizers is called.

@RuRo
Copy link
Contributor

RuRo commented Oct 24, 2021

@rohitgr7 I just tried to use the code you provided and got an

ValueError: Tried to step 16552 times. The specified number of total steps is 16550

error halfway through training. I use DP mode with 2 gpus.

I think, I remember reading that DataParallel splits the batch size instead of duplicating it on each device like DistributedDataParallel. Could that be what is causing the problem here? Any ideas on how to fix this?

@rohitgr7
Copy link
Contributor

@RuRo maybe. Since with dp I believe the total training steps doesn't scale with number of devices so the above code might not be correct for that usecase. Also its not robust since I wrote it a long time back. We are thinking to address this issue internally. Although community support/suggestions are always helpful :)

@cowwoc
Copy link
Contributor

cowwoc commented Nov 9, 2021

Watch out. Version 1.5.0 changes the default value of max_steps from None to -1 so:

        if self.trainer.max_steps:
            return self.trainer.max_steps

has to be changed to:

        if self.trainer.max_steps != -1:
            return self.trainer.max_steps

@stale
Copy link

stale bot commented Dec 10, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Dec 10, 2021
@tchaton
Copy link
Contributor

tchaton commented Dec 10, 2021

Marking to keep it alive.

@stale stale bot removed the won't fix This will not be worked on label Dec 10, 2021
@talhaanwarch
Copy link

any solution. I am getting this error ValueError: Tried to step 1392 times. The specified number of total steps is 1390.
I am using OneCycleLR

scheduler=OneCycleLR(opt,max_lr=1e-2,epochs=10,steps_per_epoch=len(df_train)//self.batch_size)

if add it to , would not increase the computational time by loading data twice

scheduler=OneCycleLR(opt,max_lr=1e-2,epochs=10,steps_per_epoch=len(self.train_dataloader()))

@stale
Copy link

stale bot commented Jan 15, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jan 15, 2022
@cowwoc
Copy link
Contributor

cowwoc commented Jan 15, 2022

Not stale.

@stale
Copy link

stale bot commented Feb 17, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Feb 17, 2022
@cowwoc
Copy link
Contributor

cowwoc commented Feb 17, 2022

See related PR: #11599

@stale stale bot removed the won't fix This will not be worked on label Feb 17, 2022
@celsofranssa
Copy link
Author

@property
def num_training_steps(self) -> int:
    """Total training steps inferred from datamodule and devices."""
    if self.trainer.max_steps:
        return self.trainer.max_steps

    limit_batches = self.trainer.limit_train_batches
    batches = len(self.train_dataloader())
    batches = min(batches, limit_batches) if isinstance(limit_batches, int) else int(limit_batches * batches)     

    num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
    if self.trainer.tpu_cores:
        num_devices = max(num_devices, self.trainer.tpu_cores)

    effective_accum = self.trainer.accumulate_grad_batches * num_devices
    return (batches // effective_accum) * self.trainer.max_epochs

@tsteffek @KevinMusgrave this shall work? pls, do let me know if I missed any edge cases, will update.

Hello @rohitgr7
This solution stops working with pytorch-lightning~=1.5.10 and Github is drawing some security warnings about pytorch-lightning<=1.5.8

@carmocca
Copy link
Contributor

carmocca commented Apr 6, 2022

Fixed with #11599

@carmocca carmocca closed this as completed Apr 6, 2022
@JulesGM
Copy link
Contributor

JulesGM commented Sep 8, 2022

@carmocca does this work with multi node training?

@carmocca
Copy link
Contributor

carmocca commented Sep 9, 2022

It should

@ayansengupta17
Copy link

ayansengupta17 commented Apr 10, 2023

For Lightning 2.0

def num_steps(self) -> int:
    """Get number of steps"""
    # Accessing _data_source is flaky and might break
    dataset = self.trainer.fit_loop._data_source.dataloader()
    dataset_size = len(dataset)
    num_devices = max(1, self.trainer.num_devices)
    num_steps = dataset_size * self.trainer.max_epochs // (self.trainer.accumulate_grad_batches * num_devices)
    return num_steps

@carmocca
Copy link
Contributor

@ayansengupta17 In your example, I suggest you access trainer.train_dataloader instead to access your original dataloader

@mwulmer
Copy link

mwulmer commented Jul 20, 2023

I am using lightning 2.0.5 and it seems that self.trainer.train_dataloader returns the combined dataloader and len(self.trainer.train_dataloader) = len(dataset)/(batchsize * num_devices). Can somebody confirm that? The function I am using is:

@property
def num_training_steps(self) -> int:
    """Get number of training steps"""
    if self.trainer.max_steps > -1:
        return self.trainer.max_steps

    self.trainer.fit_loop.setup_data()
    dataset_size = len(self.trainer.train_dataloader)
    num_steps = dataset_size * self.trainer.max_epochs

    return num_steps

@rachtibat
Copy link

It is important to return in configure_optimizers
return [optimizer], [{"scheduler": scheduler, "interval": "step"}]

and then

@property
def num_training_steps(self) -> int:
    """Get number of training steps"""
    if self.trainer.max_steps > -1:
        return self.trainer.max_steps

    self.trainer.fit_loop.setup_data()
    dataset_size = len(self.trainer.train_dataloader)
    num_steps = dataset_size * self.trainer.max_epochs // (self.trainer.accumulate_grad_batches * num_devices)
    return num_steps

works superb!

@alancneves
Copy link

I am using lightning 2.0.5 and it seems that self.trainer.train_dataloader returns the combined dataloader and len(self.trainer.train_dataloader) = len(dataset)/(batchsize * num_devices). Can somebody confirm that? The function I am using is:

@property
def num_training_steps(self) -> int:
    """Get number of training steps"""
    if self.trainer.max_steps > -1:
        return self.trainer.max_steps

    self.trainer.fit_loop.setup_data()
    dataset_size = len(self.trainer.train_dataloader)
    num_steps = dataset_size * self.trainer.max_epochs

    return num_steps

I can confirm that len(self.trainer.train_dataloader) = len(dataset)/(batchsize * num_devices).
In my case, I've to change your last line to include the accumulated_gradient, leading to:
num_steps = dataset_size * self.trainer.max_epochs // self.trainer.accumulate_grad_batches

Worked perfectly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

No branches or pull requests