-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Training EfficientAd failed: ZeroDivisionError: integer division or modulo by zero #1679
Comments
Hi @glucasol, it might not be the fix, but u have declared Padim instead of EfficientAD. I have also found this bug when I add auto_find_lr in trainer parameters for >0.7.0 anomalib config file. Still testing before I open a new issue. |
Hi, @isaacdominguez thanks for the reply! My mistake, I have updated the wrong import in the issue, but now it is correct! The error it’s not due to the import. I will check the auto_find_lr parameter, thanks! @samet-akcay any idea on what could it be? |
Ok so I found it was not the auto_find_lr, doing some test today found the same error:
Still don't know why, but this only happened to me with EfficientAD model. |
I'm a bit occupied with some other tasks, will try to have a look asap |
Juding from issue, the problem is in step_size which is equal 0. I think I encountered this once when I wanted to train EfficientAD for only 2 epochs. So I think something might be wrong with epoch setup. |
Something probably goes wrong here: anomalib/src/anomalib/models/image/efficient_ad/lightning_model.py Lines 236 to 248 in 5b79045
so step_size is set to 0, probably indicating that num_steps is also 0, which in turn means something is not correct with max_steps and max_epochs of trainer. |
@blaz-r Thanks for the reply! |
Due to the way efficientad training is specified, so it's set like this in config: |
Just ran into the same issue and the solution that @blaz-r mentioned solves it. |
For me the solution @blaz-r mentioned worked too. Thank you guys! |
Yeah, maybe we should add a check for that somewhere in code. @alexriedel1 what do you think would be the best option here? |
Without setting any steps or epochs, the trainer will default max_epochs = 1000 The easiest way (that would also make the model a bit more epoch-agnostic) would be to use the max instead of the min here:
However, the original papers say they are using 70k steps for training. From my point of view this was an arbitrary choice for training on typical anomaly datasets. In real world datasets there might the completely different numbers of steps and epochs necessary, so I have no bad feeling of using the number of epochs instead of steps. |
Thanks for the answer. If we had max here, and let's say max_epochs = 100, max_steps = 10, and "steps in each epoch" = 4, then maximum would eval to 100 * 4, but the training would stop at max_steps, which in turn means that scheduler didn't make the reduction. So, with all this information, I would actually say that we add an if statement like this: max_steps = self.trainer.max_steps
max_steps_from_epochs = self.trainer.max_epochs * len(self.trainer.datamodule.train_dataloader())
if max_steps == -1 or max_steps_from_epochs == -1:
num_steps = max(
max_steps ,
max_steps_from_epochs
)
else:
num_steps = min(
max_steps ,
max_steps_from_epochs
) which would set to the minimum of each if they are both defined. If only one is defined, then the max would take the other. If by any chance infinite training is specified, this would again fail, so it'd need another guard, but I'm not sure if it's realistic to expect infinite training here? |
Ah yes you're right! We want to recreate this behaviour of pytorch lightning:
|
Yeah. This would probably be the best solution. Do you want to make a PR or should I? |
you can go for it! thanks! |
Describe the bug
Hi, I have tried to run the following example in last anomalib version:
But It gives the following error:
Dataset
MVTec
Model
Other (please specify in the field below)
Steps to reproduce the behavior
Model: EfficientAd
OS information
OS information:
Expected behavior
Train the model successfully.
Screenshots
No response
Pip/GitHub
pip
What version/branch did you use?
Branch: main
anomalib version: v1
Configuration YAML
Default
Logs
Code of Conduct
The text was updated successfully, but these errors were encountered: