Why I have to write `.epoch_end` and all the `*_step` #2619

FrancescoSaverioZuppichini · 2020-07-16T08:11:59Z

🚀 Feature

Dear all,

Thank you for this amazing library. However, there are some features that leads to avoidable boilerplate code.

Let's take an example

class TestModule(LightningModule):
    def __init__(self, model: torch.nn.Module):
        super().__init__()
        self.model = model

    def forward(self, x):
        return self.model(x)

    def get_metrics(self, pred, y):
        return {
            'accuracy': accuracy(pred, y),
            'f1': f1_score(pred, y, num_classes=2),
            'roc_auc': auroc(pred, y)
        }

    def step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        pred = torch.argmax(y_hat, dim=1)
        metrics = self.get_metrics(pred, y)
        loss = F.cross_entropy(y_hat, y)
        
        return loss, metrics

    def training_step(self, batch, batch_idx):
        loss, metrics = self.step(batch, batch_idx)
        return {'loss': loss, 'metrics': metrics}

    def validation_step(self, batch, batch_idx):
        loss, metrics = self.step(batch, batch_idx)
        return {'val_loss': loss, 'metrics': metrics}

    def test_step(self, batch, batch_idx):
        loss, metrics = self.step(batch, batch_idx)
        return {'test_loss': loss, 'metrics': metrics}

Why I have to write all the `*_step` methods

Most of the time you want to use the same loss and return the same metrics you use in training_step.

We should have a .step the method that, if one of the *_step methods are not implemented, it will be used instead, this will remove most of the boilerplate code.

Why I have to manually compute the average metrics for each epoch.

I was expecting Pytorch Lighting to compute the average of each metric by itself as all other libraries do. Is there a specific reason for not doing it?

Thank you,

Best Regards,

Francesco

The text was updated successfully, but these errors were encountered:

github-actions · 2020-07-16T08:12:53Z

Hi! thanks for your contribution!, great first issue!

awaelchli · 2020-07-16T15:35:29Z

See #2615 PR that tries to simplify this.

FrancescoSaverioZuppichini · 2020-07-16T18:41:25Z

@awaelchli Thank you for your reply. IMHO the new syntax proposed in #2615 seems even weirded and does not seem to solve the issue.

Why do you have to reinvent the wheel? Most libraries have something like:

tr = Trainer(..., batch_metrics=[Accuracy()], epoch_metrics=[Accuracy(), ROCAUC()]
...

This has three advantages:

it makes sense
no need to explain in the doc how to properly return the metrics or to return them in order to be added to the logger etc
no need to write more code than needed

Then one can also choose to manually return a metric from the *_step functions.

To save memory and compute the epochs metrics you can have default names to the metric and store the batch ones only if the epoch one exists.

What do you think?

Cheers,

Francesco

awaelchli · 2020-07-16T19:43:22Z

Why do you have to reinvent the wheel? Most libraries have something like:

I disagree. PyTorchLighting is not trying to be like other libraries. Which parts do you see as "reinvented"?

it makes sense

Not to me by a large margin, but your example code for me perfectly highlights the design concepts behind LightningModule: Everything that is specific to the training and evaluation to a system goes into the LightningModule. The Trainer can be thought of the encapsulation of all the "boilerplate" PyTorch training loop etc and does not contain any research specific properties (like metrics or how losses are computed etc.)
A Trainer object can also be used for several different LightningModules. So having these metrics as argument to the Trainer is not a good thing in my view.

no need to explain in the doc how to properly return the metrics or to return them in order to be added to the logger etc

It looks like the docs will become simpler with the structured results feature, but remain backward compatible.

no need to write more code than needed

With the structured results, you won't have to implement the *_end and *_epoch_end unless you want to do crazy things as I understand it.

FrancescoSaverioZuppichini · 2020-07-16T20:13:54Z

Thank you @awaelchli for the fast reply!

I disagree. PyTorchLighting is not trying to be like other libraries. Which parts do you see as "reinvented"?

It remembers me of web development some years ago, everybody claiming they were doing different things ... but this is out of topic

Well, in my opinion, if I had to rewrite three times the same code train_step, val_step, etc.. it means that something was not well made and can be definitely improved! I hope you agree to at least one point :) To be honest, you have to rewrite the same code three times for the *_step methods and three-time for the *_end methods if you want to also log something at the end of each epoch.

So having these metrics as argument to the Trainer is not a good thing in my view.

You can pass them to the .fit method :)

Anyway, thank you again for the fast reply. I really appreciate it!

Cheers,

Francesco

awaelchli · 2020-07-17T15:22:24Z

You can pass them to the .fit method :)

My same argument applies there. It's not different from the Trainer args case. I want the metrics etc. in my Lightningmodule, and not outside. If I wanted to pass them in from the outside, I would make it a hyperparameter.

@PyTorchLightning/core-contributors any thoughts here?

FrancescoSaverioZuppichini · 2020-07-25T06:48:09Z

@awaelchli Thank you for your reply. At this point, I don't see any point in continuing the discussion. I will just copy and paste the code over and over again hoping somebody will implement a new library with better practices soon. Thank you.

Be safe!

Francesco

awaelchli · 2020-07-25T22:34:46Z

About the code repetitions you mention, there is a way to get rid of them. You can define your default val- and test_step methods in a separate class and then inherit it in your various LightningModules. Btw I like the idea of the general purpose step method.

williamFalcon · 2020-07-26T01:51:54Z

in the latest version on master you don’t need step_end or epoch_end.

to use the same step you can just call validation_step from training_step.

docs will be updated in a few days.

Before

(to be fair, only training_step was required)

def training_step(...):
    return loss

def training_step_end(...):
    ...

def training_epoch_end(...):
   ...

Now:

def training_step(...):
    result = TrainResult(minimize=loss)
    return result

To use the same loop for val, test and train

def step(...):
   return something

def training_step(...):
    something = self.step(...)
    result = TrainResult(minimize=something)
    return result

def validation_step(...):
    something = self.step(...)
    result = EvalResult(checkpoint_on=something)
    result.log('val_loss', something)
    return result

def test_step(...):
    result = self.validation_step(...)
    result.log('test_loss', something)
    return result

Maybe for your applications you use the same loop for all three... however, that's very unusual and usually what happens in the train step is different than the other loops. And the metrics need to be names according to what loop you are in... For instance in train loop you log or track things on every batch. On val/test loop you only care about full epoch statistics.

So, I'm curious @FrancescoSaverioZuppichini. In what application are you able to use the same loop for everything?

williamFalcon · 2020-07-26T05:45:18Z

Ok, docs have been updated!

Quick start guide

https://pytorch-lightning.readthedocs.io/en/latest/new-project.html

Or the walk-through

https://pytorch-lightning.readthedocs.io/en/latest/introduction_guide.html#validating

@FrancescoSaverioZuppichini is that more in line with what you were hoping for? :)

FrancescoSaverioZuppichini · 2020-07-26T06:47:47Z

Hi @williamFalcon,

I hope you are doing great. Thank you for the reply, I am doing image classification and I want to return the same metrics from all my dataset splits.

I have found it very hard to understand what I have to return from each *_step and I am used to code a lot. I am sorry to say but the new APIs do not look great, it seems (to me) to have even complicated more the codebase. What I want is to just pass a metric and that's it, it should work. I think this is something that people expect to be super easy because it is!

Let me know if I can help or you are fine with the current implementation.

Thanks :)

Francesco

williamFalcon · 2020-07-26T12:12:56Z

Could you give me an example of how it might work? if it meets the following requirements we could consider it:

it needs to allow the user to only log
it needs to allow the user to only show in progress bar
it can allow both or neither
it can be the metric for every step
or it can be the metric for the full epoch
or it can be both
and for more complicated use cases (lightning js a Pro tool after all and not something for beginners), i might want to do a distributed softmax and would want access to all outputs of all GPUs on the same node.
the thing i want to log is not just MSE loss or accuracy... but something more complicated like a GAN loss with a custom regularizer or even a perceptual loss using MSE on VGG feature maps.

i think your metrics passed to trainer suggestion is interesting but i think this has a few pitfalls:

you don’t have control as a user of what actually happens in that metric.
models often depend on certain metrics... removing them from the model makes it unusable.
what about non standard metrics? if your work is 99% accuracy based then ok. but unlikely to be true in a majority of cases.

also, what about when i may want accuracy but calculated in a weird way or for a different type of batch structure? ie: metric learning or something. then my model is no longer self contained

thanks!

FrancescoSaverioZuppichini · 2020-09-13T17:09:20Z

@williamFalcon sorry for the late reply, I was super busy and this topic went under the radar, my bad. I see your point but (with all due respect) I think you are wrong. Unfortunately, I don't have the time know to show my counter-arguments and I suspect the discussion will go on forever. Very quickly, you can have default behavior and allow users to implement specific methods as you are doing now, this is classic OOP.

I am down for a quick voice chat if you would like to hear my suggestion in detail or I will comment here in the not so near future. In the meantime, I will drop PyTorch lighting since it doesn't speed up my productivity, and, sorry to say, bugs are everywhere!

Thank you for the discussion ;)

Bests

williamFalcon · 2020-09-13T18:41:11Z

I hear you!
Unfortunately Lightning is a professional tool... so, it's not meant to abstract details as you're suggesting.

This is why it's used to create new state of the art methods and new research.
(example: https://github.com/mindslab-ai/cotatron).

You might be looking for something else that doesn't require as much pytorch knowledge or is meant for simpler cases.

This is unfortunately not the aim of lightning.

FrancescoSaverioZuppichini · 2020-09-14T06:08:32Z

Hi @williamFalcon, thank you for the reply! I have a lot of programming knowledge and I have been using PyTorch for years. But if I have to write tons of code to just do a very basic image classification, then probably there is something not so well design in your library (+ bugs). Don't get me wrong, there are lots of good features.

Tensorflow is also used to create sota methods, but it is a terrible tool. IMHO it should be easy to use for the easy task and hard to use for the hard task, not hard for both. I hope no offense is taken, but because you are the creator you are not so willing to hear :)

Thank you :)

Bests,

P.S. You can close the issue now

FrancescoSaverioZuppichini added feature Is an improvement or enhancement help wanted Open to be worked on labels Jul 16, 2020

williamFalcon closed this as completed Sep 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why I have to write `.epoch_end` and all the `*_step` #2619

Why I have to write `.epoch_end` and all the `*_step` #2619

FrancescoSaverioZuppichini commented Jul 16, 2020

github-actions bot commented Jul 16, 2020

awaelchli commented Jul 16, 2020

FrancescoSaverioZuppichini commented Jul 16, 2020 •

edited

Loading

awaelchli commented Jul 16, 2020

FrancescoSaverioZuppichini commented Jul 16, 2020 •

edited by Borda

Loading

awaelchli commented Jul 17, 2020

FrancescoSaverioZuppichini commented Jul 25, 2020

awaelchli commented Jul 25, 2020

williamFalcon commented Jul 26, 2020 •

edited

Loading

williamFalcon commented Jul 26, 2020 •

edited

Loading

FrancescoSaverioZuppichini commented Jul 26, 2020

williamFalcon commented Jul 26, 2020 •

edited

Loading

FrancescoSaverioZuppichini commented Sep 13, 2020 •

edited

Loading

williamFalcon commented Sep 13, 2020

FrancescoSaverioZuppichini commented Sep 14, 2020 •

edited

Loading

Why I have to write .epoch_end and all the *_step #2619

Why I have to write .epoch_end and all the *_step #2619

Comments

FrancescoSaverioZuppichini commented Jul 16, 2020

🚀 Feature

Why I have to write all the *_step methods

Why I have to manually compute the average metrics for each epoch.

github-actions bot commented Jul 16, 2020

awaelchli commented Jul 16, 2020

FrancescoSaverioZuppichini commented Jul 16, 2020 • edited Loading

awaelchli commented Jul 16, 2020

FrancescoSaverioZuppichini commented Jul 16, 2020 • edited by Borda Loading

awaelchli commented Jul 17, 2020

FrancescoSaverioZuppichini commented Jul 25, 2020

awaelchli commented Jul 25, 2020

williamFalcon commented Jul 26, 2020 • edited Loading

Before

Now:

To use the same loop for val, test and train

williamFalcon commented Jul 26, 2020 • edited Loading

Quick start guide

Or the walk-through

FrancescoSaverioZuppichini commented Jul 26, 2020

williamFalcon commented Jul 26, 2020 • edited Loading

FrancescoSaverioZuppichini commented Sep 13, 2020 • edited Loading

williamFalcon commented Sep 13, 2020

FrancescoSaverioZuppichini commented Sep 14, 2020 • edited Loading

Why I have to write `.epoch_end` and all the `*_step` #2619

Why I have to write `.epoch_end` and all the `*_step` #2619

Why I have to write all the `*_step` methods

FrancescoSaverioZuppichini commented Jul 16, 2020 •

edited

Loading

FrancescoSaverioZuppichini commented Jul 16, 2020 •

edited by Borda

Loading

williamFalcon commented Jul 26, 2020 •

edited

Loading

williamFalcon commented Jul 26, 2020 •

edited

Loading

williamFalcon commented Jul 26, 2020 •

edited

Loading

FrancescoSaverioZuppichini commented Sep 13, 2020 •

edited

Loading

FrancescoSaverioZuppichini commented Sep 14, 2020 •

edited

Loading