Number of steps per epoch #5449

celsofranssa · 2021-01-10T15:26:24Z

Some learning rate schedulers as OneCycleLR requires the number of steps per epoch.

Then, how to get the number of steps in configure_optimizers(self) scope?

Note: Training data is given during Trainer instantiation:

# training
    dm.setup('fit')
    trainer.fit(model, datamodule=dm)

The text was updated successfully, but these errors were encountered:

tchaton · 2021-01-11T10:36:10Z

This should work just fine.

def configure_optimizers(self):
       num_batches = len(self.train_dataloader()) / self.trainer.accumulate_grad_batches
      ....
      return [optim..], [scheduler...]

Note: If you pass for train/val_dataloader or datamodule directly into the .fit function, Lightning will override the train_dataloader() function with the provided one, so it should work fine.

Here is a good approximation for the total number of steps.

    @property
    def num_training_steps(self) -> int:
        """Total training steps inferred from datamodule and devices."""
        dataset = self.train_dataloader()
        if self.trainer.max_steps:
            return self.trainer.max_steps

        dataset_size = (
            self.trainer.limit_train_batches
            if self.trainer.limit_train_batches != 0
            else len(dataset)
        )

        num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
        if self.trainer.tpu_cores:
            num_devices = max(num_devices, self.trainer.tpu_cores)

        effective_batch_size = dataset.batch_size * self.trainer.accumulate_grad_batches * num_devices
        return (dataset_size // effective_batch_size) * self.trainer.max_epochs

Closing this issue as the answer should work out. Feel free to re-open it if it doesn't solve your problem.

celsofranssa · 2021-01-11T14:04:22Z

When using num_batches = len(self.train_dataloader()) / self.trainer.accumulate_grad_baches, the following error happens:

Traceback (most recent call last):
  File "xCoFormer.py", line 169, in perform_tasks
    fit(hparams)
  File "xCoFormer.py", line 84, in fit
    trainer.fit(model, datamodule=dm)
  File "/home/celso/projects/venvs/xCoFormer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 456, in fit
    self.accelerator_backend.setup(model)
  File "/home/celso/projects/venvs/xCoFormer/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 52, in setup
    self.setup_optimizers(model)
  File "/home/celso/projects/venvs/xCoFormer/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 145, in setup_optimizers
    optimizers, lr_schedulers, optimizer_frequencies = self.trainer.init_optimizers(model)
  File "/home/celso/projects/venvs/xCoFormer/lib/python3.7/site-packages/pytorch_lightning/trainer/optimizers.py", line 30, in init_optimizers
    optim_conf = model.configure_optimizers()
  File "/home/celso/projects/xCoFormer/source/model/JointEncoder.py", line 44, in configure_optimizers
    print("num_batches: ", len(self.train_dataloader()) / self.trainer.accumulate_grad_baches)
AttributeError: 'Trainer' object has no attribute 'accumulate_grad_baches'

celsofranssa · 2021-01-11T14:57:08Z

Closing this issue as the answer should work out. Feel free to re-open it if it doesn't solve your problem.

Did you mean create a new issue?

rohitgr7 · 2021-01-11T15:45:55Z

@celsofranssa it's self.trainer.accumulate_grad_batches. just a typo there.

celsofranssa · 2021-01-11T16:05:44Z

I got it, thanks!

celsofranssa · 2021-01-14T14:04:15Z

This should work just fine.

def configure_optimizers(self):
       num_batches = len(self.train_dataloader()) / self.trainer.accumulate_grad_batches
      ....
      return [optim..], [scheduler...]

Note: If you pass for train/val_dataloader or datamodule directly into the .fit function, Lightning will override the train_dataloader() function with the provided one, so it should work fine.

Here is a good approximation for the total number of steps.

    @property
    def num_training_steps(self) -> int:
        """Total training steps inferred from datamodule and devices."""
        dataset = self.train_dataloader()
        if self.trainer.max_steps:
            return self.trainer.max_steps

        dataset_size = (
            self.trainer.limit_train_batches
            if self.trainer.limit_train_batches != 0
            else len(dataset)
        )

        num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
        if self.trainer.tpu_cores:
            num_devices = max(num_devices, self.trainer.tpu_cores)

        effective_batch_size = dataset.batch_size * self.trainer.accumulate_grad_batches * num_devices
        return (dataset_size // effective_batch_size) * self.trainer.max_epochs

Closing this issue as the answer should work out. Feel free to re-open it if it doesn't solve your problem.

dataset_size is being set to 1.0 instead of the real number of batches in the dataloader because:

# default used by the Trainer
trainer = Trainer(limit_train_batches=1.0)

KevinMusgrave · 2021-02-05T00:30:44Z

@tchaton I don't think the num_training_steps function works. As @celsofranssa pointed out, dataset_size gets set to 1, so the function returns 0 because (dataset_size // effective_batch_size) equals 0.

tsteffek · 2021-02-05T19:52:34Z

I just stumbled upon the same problem, and tried len(self.train_dataloader()) // self.trainer.accumulate_grad_batches. Works for me.

Your misunderstanding is that dataset_size sets the size to 1. Note that the default is 1.0, a float. Floats will be used to scale the actual size: The default is therefore actual_size * 1.0.

Then again, I'm not opposed to a convenience function/parameter steps_per_epoch.

KevinMusgrave · 2021-02-05T20:13:03Z

I was referring to the num_training_steps function:

Here is a good approximation for the total number of steps.

    @property
    def num_training_steps(self) -> int:
        """Total training steps inferred from datamodule and devices."""
        dataset = self.train_dataloader()
        if self.trainer.max_steps:
            return self.trainer.max_steps

        dataset_size = (
            self.trainer.limit_train_batches
            if self.trainer.limit_train_batches != 0
            else len(dataset)
        )

        num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
        if self.trainer.tpu_cores:
            num_devices = max(num_devices, self.trainer.tpu_cores)

        effective_batch_size = dataset.batch_size * self.trainer.accumulate_grad_batches * num_devices
        return (dataset_size // effective_batch_size) * self.trainer.max_epochs

tsteffek · 2021-02-05T20:18:24Z

Oh, that dataset_size. Same name everywhere, it's too confusing. Sorry then.

rohitgr7 · 2021-02-05T20:18:37Z

@property
def num_training_steps(self) -> int:
    """Total training steps inferred from datamodule and devices."""
    if self.trainer.max_steps:
        return self.trainer.max_steps

    limit_batches = self.trainer.limit_train_batches
    batches = len(self.train_dataloader())
    batches = min(batches, limit_batches) if isinstance(limit_batches, int) else int(limit_batches * batches)     

    num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
    if self.trainer.tpu_cores:
        num_devices = max(num_devices, self.trainer.tpu_cores)

    effective_accum = self.trainer.accumulate_grad_batches * num_devices
    return (batches // effective_accum) * self.trainer.max_epochs

@tsteffek @KevinMusgrave this shall work? pls, do let me know if I missed any edge cases, will update.

KevinMusgrave · 2021-02-05T20:29:49Z

@rohitgr7 Thanks, that seems to work 👍

tsteffek · 2021-02-05T20:46:49Z

So I've been thinking that I've seen that functionality somewhere, and indeed there is a trainer.num_training_batches.

Caveat: it's 0 at the time of configure_optimizers, sadly.

Leaving that nugget of knowledge in case that helps someone.

izikgo · 2021-06-25T13:49:29Z

@property
def num_training_steps(self) -> int:
    """Total training steps inferred from datamodule and devices."""
    if self.trainer.max_steps:
        return self.trainer.max_steps

    limit_batches = self.trainer.limit_train_batches
    batches = len(self.train_dataloader())
    batches = min(batches, limit_batches) if isinstance(limit_batches, int) else int(limit_batches * batches)     

    num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
    if self.trainer.tpu_cores:
        num_devices = max(num_devices, self.trainer.tpu_cores)

    effective_accum = self.trainer.accumulate_grad_batches * num_devices
    return (batches // effective_accum) * self.trainer.max_epochs

@tsteffek @KevinMusgrave this shall work? pls, do let me know if I missed any edge cases, will update.

Does this still hold for multi-node training? I think self.trainer.num_gpus only registers number of gpus per host, not the global amount. num_devices should probably be multiplied by self.trainer.num_nodes

cowwoc · 2021-09-03T04:46:28Z

The above functions did not yield the correct number of steps per epoch for me so I dug into the source code of progress.py on_train_epoch_start(self, trainer, pl_module) and came up with this:

    @property
    def total_train_batches(self) -> int:
        """
        The total number of training batches during training, which may change from epoch to epoch.
        """
        return math.ceil(len(self.train_dataset) / self.batch_size)

    @property
    def total_val_batches(self) -> int:
        """
        The total number of validation batches during validation, which may change from epoch to epoch.
        """
        total_val_batches = 0
        if self.trainer.enable_validation:
            is_val_epoch = (self.trainer.current_epoch + 1) % self.trainer.check_val_every_n_epoch == 0
            total_val_batches = math.ceil(len(self.val_dataset) / self.batch_size) if is_val_epoch else 0
        return total_val_batches

    @property
    def training_steps_per_epoch(self) -> int:
        """
        The number of training steps per epoch.

        Taken from progress.py on_train_epoch_start(self, trainer, pl_module)
        """
        total_train_batches = self.total_train_batches
        total_val_batches = self.total_val_batches

        # val can be checked multiple times per epoch
        val_check_batch = int(total_train_batches * self.trainer.val_check_interval)
        val_check_batch = max(1, val_check_batch)
        val_checks_per_epoch = total_train_batches // val_check_batch

        total_val_batches = total_val_batches * val_checks_per_epoch
        return total_train_batches + total_val_batches

Now the function returns the same number of steps per epoch as the progress bar and can be invoked from configure_optimizers(). I hope this helps someone else.

tchaton · 2021-09-03T08:08:18Z

Hey @cowwoc,

I am not sure you should be using the progress bar as it counts the total number of batches across processes + adds validation.

I believed the previous implementation was more accurate.

Best,
T.C

cowwoc · 2021-09-03T13:06:51Z

@tchaton Is it wrong to include the number of validation steps? Doesn't lr_scheduler.step() get invoked during validation as well?

tchaton · 2021-09-03T13:31:09Z

No, it doesn't.

cowwoc · 2021-09-03T13:43:44Z

@tchaton Okay, thanks for the clarification. Are there any plans to add this functionality (getting the number of steps per epoch or total number of steps) into the official API?

Also, what about @igolan89's question for multiple nodes?

rohitgr7 · 2021-09-05T14:34:30Z

Are there any plans to add this functionality (getting the number of steps per epoch or total number of steps) into the official API?

I think it might be a bit risky to add such a functionality since it calls len(self.train_dataloader()) and we have an argument reload_dataloader_every_n_epoch. So there is a possibility of not having the same number of batches in each training epoch and since configure_optimizers is called just once, they might not sync up. There might be other issues as well since this depends on some dynamic variables which are prone to change while training.

cowwoc · 2021-09-05T17:07:03Z

To which I say... If the committers can't figure out how to implement this safely, what chance do end-users have? :)

RuRo · 2021-10-20T21:50:03Z

@tchaton do you mind reopening this issue? Since this seems to be such a non-trivial problem, perhaps lightning should provide steps_per_epoch (or num_training_steps) property out of the box?

tchaton · 2021-10-21T08:53:03Z

Dear @RuRo,

Done. We will re-consider this feature, but it is hard to estimate correctly due to varying dataloaders length / iterators, etc..

Best,
T.C

RuRo · 2021-10-21T10:24:19Z

@tchaton Thanks. A few potential solutions, that I would personally be okay with:

Just assume that the train_dataloader length is deterministic w.r.t the hparams/arguments/passed train_dataloader/datamodule/etc. Blame the user if it is not. 😁
Instead of adding a @property, add something like a current_steps_per_epoch value (emphasis on "current_").
- Automatically update that value after each train_dataloader call.
- Call configure_optimizers after the first train_dataloader is already obtained.
  (Not sure if this is viable with the current train loop implementation)
This doesn't really solve the problem if reload_dataloader_every_n_epoch is used, but at least the current_steps_per_epoch value itself is technically always correct and doesn't "lie" to the user.

I am not sure, if the (reload_dataloader_every_n_epoch + non-deterministic dataloader length) situation is common in actual use cases and is even worth worrying about. I can't really imagine a case, where the length of the dataloader is intentionally different after each call to train_dataloder.

eladsegal · 2021-10-22T19:44:20Z

@property
def num_training_steps(self) -> int:
    """Total training steps inferred from datamodule and devices."""
    if self.trainer.max_steps:
        return self.trainer.max_steps

    limit_batches = self.trainer.limit_train_batches
    batches = len(self.train_dataloader())
    batches = min(batches, limit_batches) if isinstance(limit_batches, int) else int(limit_batches * batches)     

    num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
    if self.trainer.tpu_cores:
        num_devices = max(num_devices, self.trainer.tpu_cores)

    effective_accum = self.trainer.accumulate_grad_batches * num_devices
    return (batches // effective_accum) * self.trainer.max_epochs

@tsteffek @KevinMusgrave this shall work? pls, do let me know if I missed any edge cases, will update.

@rohitgr7
When using DDP trainer.num_training_batches returns the number of batches after they are already divided across the GPUs, so no need to multiply by num_devices in this case

rohitgr7 · 2021-10-22T19:50:28Z

@eladsegal not using trainer.num_training_batches anywhere in the code because that isn't computed till the point configure_optimizers is called.

RuRo · 2021-10-24T07:21:24Z

@rohitgr7 I just tried to use the code you provided and got an

ValueError: Tried to step 16552 times. The specified number of total steps is 16550

error halfway through training. I use DP mode with 2 gpus.

I think, I remember reading that DataParallel splits the batch size instead of duplicating it on each device like DistributedDataParallel. Could that be what is causing the problem here? Any ideas on how to fix this?

rohitgr7 · 2021-10-24T10:51:10Z

@RuRo maybe. Since with dp I believe the total training steps doesn't scale with number of devices so the above code might not be correct for that usecase. Also its not robust since I wrote it a long time back. We are thinking to address this issue internally. Although community support/suggestions are always helpful :)

cowwoc · 2021-11-09T05:11:06Z

Watch out. Version 1.5.0 changes the default value of max_steps from None to -1 so:

        if self.trainer.max_steps:
            return self.trainer.max_steps

has to be changed to:

        if self.trainer.max_steps != -1:
            return self.trainer.max_steps

stale · 2021-12-10T08:10:04Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

tchaton · 2021-12-10T09:12:12Z

Marking to keep it alive.

talhaanwarch · 2021-12-15T18:02:12Z

any solution. I am getting this error ValueError: Tried to step 1392 times. The specified number of total steps is 1390.
I am using OneCycleLR

scheduler=OneCycleLR(opt,max_lr=1e-2,epochs=10,steps_per_epoch=len(df_train)//self.batch_size)

if add it to , would not increase the computational time by loading data twice

scheduler=OneCycleLR(opt,max_lr=1e-2,epochs=10,steps_per_epoch=len(self.train_dataloader()))

stale · 2022-01-15T06:42:10Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

cowwoc · 2022-01-15T07:42:18Z

Not stale.

stale · 2022-02-17T07:09:29Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

cowwoc · 2022-02-17T13:40:03Z

See related PR: #11599

celsofranssa · 2022-02-18T20:43:07Z

@property
def num_training_steps(self) -> int:
    """Total training steps inferred from datamodule and devices."""
    if self.trainer.max_steps:
        return self.trainer.max_steps

    limit_batches = self.trainer.limit_train_batches
    batches = len(self.train_dataloader())
    batches = min(batches, limit_batches) if isinstance(limit_batches, int) else int(limit_batches * batches)     

    num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
    if self.trainer.tpu_cores:
        num_devices = max(num_devices, self.trainer.tpu_cores)

    effective_accum = self.trainer.accumulate_grad_batches * num_devices
    return (batches // effective_accum) * self.trainer.max_epochs

@tsteffek @KevinMusgrave this shall work? pls, do let me know if I missed any edge cases, will update.

Hello @rohitgr7
This solution stops working with pytorch-lightning~=1.5.10 and Github is drawing some security warnings about pytorch-lightning<=1.5.8

carmocca · 2022-04-06T04:09:15Z

Fixed with #11599

JulesGM · 2022-09-08T21:57:16Z

@carmocca does this work with multi node training?

carmocca · 2022-09-09T11:41:19Z

It should

ayansengupta17 · 2023-04-10T09:20:25Z

For Lightning 2.0

def num_steps(self) -> int:
    """Get number of steps"""
    # Accessing _data_source is flaky and might break
    dataset = self.trainer.fit_loop._data_source.dataloader()
    dataset_size = len(dataset)
    num_devices = max(1, self.trainer.num_devices)
    num_steps = dataset_size * self.trainer.max_epochs // (self.trainer.accumulate_grad_batches * num_devices)
    return num_steps

carmocca · 2023-04-10T15:12:27Z

@ayansengupta17 In your example, I suggest you access trainer.train_dataloader instead to access your original dataloader

mwulmer · 2023-07-20T09:07:04Z

I am using lightning 2.0.5 and it seems that self.trainer.train_dataloader returns the combined dataloader and len(self.trainer.train_dataloader) = len(dataset)/(batchsize * num_devices). Can somebody confirm that? The function I am using is:

@property
def num_training_steps(self) -> int:
    """Get number of training steps"""
    if self.trainer.max_steps > -1:
        return self.trainer.max_steps

    self.trainer.fit_loop.setup_data()
    dataset_size = len(self.trainer.train_dataloader)
    num_steps = dataset_size * self.trainer.max_epochs

    return num_steps

rachtibat · 2023-12-21T19:31:51Z

It is important to return in configure_optimizers
return [optimizer], [{"scheduler": scheduler, "interval": "step"}]

and then

@property
def num_training_steps(self) -> int:
    """Get number of training steps"""
    if self.trainer.max_steps > -1:
        return self.trainer.max_steps

    self.trainer.fit_loop.setup_data()
    dataset_size = len(self.trainer.train_dataloader)
    num_steps = dataset_size * self.trainer.max_epochs // (self.trainer.accumulate_grad_batches * num_devices)
    return num_steps

works superb!

alancneves · 2024-01-10T19:09:09Z

I am using lightning 2.0.5 and it seems that self.trainer.train_dataloader returns the combined dataloader and len(self.trainer.train_dataloader) = len(dataset)/(batchsize * num_devices). Can somebody confirm that? The function I am using is:
@property
def num_training_steps(self) -> int:
    """Get number of training steps"""
    if self.trainer.max_steps > -1:
        return self.trainer.max_steps

    self.trainer.fit_loop.setup_data()
    dataset_size = len(self.trainer.train_dataloader)
    num_steps = dataset_size * self.trainer.max_epochs

    return num_steps

I can confirm that len(self.trainer.train_dataloader) = len(dataset)/(batchsize * num_devices).
In my case, I've to change your last line to include the accumulated_gradient, leading to:
num_steps = dataset_size * self.trainer.max_epochs // self.trainer.accumulate_grad_batches

Worked perfectly!

celsofranssa added the question Further information is requested label Jan 10, 2021

tchaton added the waiting on author Waiting on user action, correction, or update label Jan 11, 2021

tchaton closed this as completed Jan 11, 2021

tchaton reopened this Oct 21, 2021

rohitgr7 mentioned this issue Nov 25, 2021

Add total training steps as a property to trainer #10760

Closed

stale bot added the won't fix This will not be worked on label Dec 10, 2021

stale bot removed the won't fix This will not be worked on label Dec 10, 2021

stale bot added the won't fix This will not be worked on label Jan 15, 2022

stale bot removed the won't fix This will not be worked on label Jan 15, 2022

kazhang mentioned this issue Jan 24, 2022

Retrieve exact number of training steps in configure_optimizers #3115

Closed

stale bot added the won't fix This will not be worked on label Feb 17, 2022

stale bot removed the won't fix This will not be worked on label Feb 17, 2022

zhouwei5113 mentioned this issue Mar 2, 2022

NotImplementedError: train_dataloader must be implemented to be used with the Lightning Trainer? Zasder3/train-CLIP#29

Closed

carmocca closed this as completed Apr 6, 2022

PicoCreator mentioned this issue Jun 25, 2023

[dev-infctx][batch 1] Added lr final support Blealtan/RWKV-LM-LoRA#33

Merged

Number of steps per epoch #5449

Number of steps per epoch #5449

Comments

celsofranssa commented Jan 10, 2021

tchaton commented Jan 11, 2021 • edited by rohitgr7 Loading

celsofranssa commented Jan 11, 2021

celsofranssa commented Jan 11, 2021

rohitgr7 commented Jan 11, 2021 • edited Loading

celsofranssa commented Jan 11, 2021

celsofranssa commented Jan 14, 2021

KevinMusgrave commented Feb 5, 2021

tsteffek commented Feb 5, 2021

KevinMusgrave commented Feb 5, 2021 • edited Loading

tsteffek commented Feb 5, 2021

rohitgr7 commented Feb 5, 2021

KevinMusgrave commented Feb 5, 2021

tsteffek commented Feb 5, 2021 • edited Loading

izikgo commented Jun 25, 2021 • edited Loading

cowwoc commented Sep 3, 2021

tchaton commented Sep 3, 2021

cowwoc commented Sep 3, 2021

tchaton commented Sep 3, 2021

cowwoc commented Sep 3, 2021 • edited Loading

rohitgr7 commented Sep 5, 2021

cowwoc commented Sep 5, 2021

RuRo commented Oct 20, 2021

tchaton commented Oct 21, 2021

RuRo commented Oct 21, 2021

eladsegal commented Oct 22, 2021

rohitgr7 commented Oct 22, 2021 • edited Loading

RuRo commented Oct 24, 2021

rohitgr7 commented Oct 24, 2021

cowwoc commented Nov 9, 2021 • edited Loading

stale bot commented Dec 10, 2021

tchaton commented Dec 10, 2021

talhaanwarch commented Dec 15, 2021

stale bot commented Jan 15, 2022

cowwoc commented Jan 15, 2022

stale bot commented Feb 17, 2022

cowwoc commented Feb 17, 2022

celsofranssa commented Feb 18, 2022

carmocca commented Apr 6, 2022

JulesGM commented Sep 8, 2022

carmocca commented Sep 9, 2022

ayansengupta17 commented Apr 10, 2023 • edited Loading

carmocca commented Apr 10, 2023

mwulmer commented Jul 20, 2023

rachtibat commented Dec 21, 2023

alancneves commented Jan 10, 2024

tchaton commented Jan 11, 2021 •

edited by rohitgr7

Loading

rohitgr7 commented Jan 11, 2021 •

edited

Loading

KevinMusgrave commented Feb 5, 2021 •

edited

Loading

tsteffek commented Feb 5, 2021 •

edited

Loading

izikgo commented Jun 25, 2021 •

edited

Loading

cowwoc commented Sep 3, 2021 •

edited

Loading

rohitgr7 commented Oct 22, 2021 •

edited

Loading

cowwoc commented Nov 9, 2021 •

edited

Loading

ayansengupta17 commented Apr 10, 2023 •

edited

Loading