`LearningRateFinder` defining max validation batches for entire training loop #17412

blainehoak · 2023-04-18T23:31:20Z

Bug description

When the LearningRateFinder callback is used, the num_training_steps parameter that is passed on init (default: 100) ends up defining how many validation batches to run during the entire length of training. Meaning that if num_training_steps in the learning rate finder is less than the total number of batches in your validation set, then all validation loops while training will only see a subset of the validation data.

What version are you seeing the problem on?

2.0+

How to reproduce the bug

This code will fail because trainer.num_val_batches[0] = 5.

import os

import torch
from torch.utils.data import DataLoader, Dataset

from lightning.pytorch import LightningModule, Trainer
from lightning.pytorch.callbacks import LearningRateFinder


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self, lr=0.1):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)
        self.lr = lr

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=self.lr)


def run():
    train_data = DataLoader(RandomDataset(32, 100), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 100), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        callbacks=[LearningRateFinder(num_training_steps=5)],
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    assert trainer.num_val_batches[0] == 50
    trainer.validate(model, dataloaders=val_data)


if __name__ == "__main__":
    run()

Using the same base code from above but removing the LearningRateFinder, this code passes.

    trainer = Trainer(
        default_root_dir=os.getcwd(),
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    assert trainer.num_val_batches[0] == 50
    trainer.validate(model, dataloaders=val_data)

However, num_val_batches does get updated once .validate() is called. Putting the LearningRateFinder back in but moving the assert statement, this code passes:

    trainer = Trainer(
        default_root_dir=os.getcwd(),
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        callbacks=[LearningRateFinder(num_training_steps=5)],
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.validate(model, dataloaders=val_data)
    assert trainer.num_val_batches[0] == 50

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0): 2.0.1.post0
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0): 2.0.0
#- Python version (e.g., 3.9): 3.10.9
#- OS (e.g., Linux): Darwin
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source): pip
#- Running environment of LightningApp (e.g. local, cloud):

More info

I did some more digging into why this might be happening and it looks like the problem is likely coming from the fact that trainer.fit_loop.epoch_loop.val_loop.setup_data() is getting called for the first time while the learning rate finder is running, so trainer.fit_loop.epoch_loop.val_loop._max_batches gets set according to the parameters that the learning rate finder has passed in.

Even though the learning rate finder restores the parameters that the trainer initially set once it is done, the setup_data() method never runs a full setup again, so the _max_batches attribute never gets updated again.

One solution to fix this might be to redo the data setup once the learning rate finder has completed, like how setup is redone when .validate() is called

The text was updated successfully, but these errors were encountered:

awaelchli · 2023-04-23T01:18:30Z

@blainehoak Thanks for reporting. Help on this would be appreciated :) You are right, the finder is probably not resetting all variables correctly.

blainehoak added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 18, 2023

awaelchli added tuner help wanted Open to be worked on and removed needs triage Waiting to be triaged by maintainers labels Apr 23, 2023

awaelchli added this to the 2.0.x milestone Apr 23, 2023

Borda changed the title ~~LearningRateFinder defining max validation batches for entire training loop~~ LearningRateFinder defining max validation batches for entire training loop May 3, 2023

Borda added v2.0+ ver: 2.0.x and removed v2.0+ labels May 3, 2023

baskrahmer mentioned this issue May 15, 2023

Bugfix/17412 lr finder max val batches #17636

Merged

7 tasks

Borda closed this as completed in #17636 May 18, 2023

joncarter1 mentioned this issue Aug 25, 2023

BatchSizeFinder defining max validation batches for entire training loop #18394

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`LearningRateFinder` defining max validation batches for entire training loop #17412

`LearningRateFinder` defining max validation batches for entire training loop #17412

blainehoak commented Apr 18, 2023 •

edited

Loading

awaelchli commented Apr 23, 2023

LearningRateFinder defining max validation batches for entire training loop #17412

LearningRateFinder defining max validation batches for entire training loop #17412

Comments

blainehoak commented Apr 18, 2023 • edited Loading

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Environment

More info

awaelchli commented Apr 23, 2023

`LearningRateFinder` defining max validation batches for entire training loop #17412

`LearningRateFinder` defining max validation batches for entire training loop #17412

blainehoak commented Apr 18, 2023 •

edited

Loading