-
Notifications
You must be signed in to change notification settings - Fork 755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN losses leads to overfitting #1796
Comments
Hi @sevstafiev, Thank you for raising this issue. One thing I noticed from your snipped is that the learning rate is rather high. I would suggest reducing it to |
Hello, |
The following snippet appears to reproduce the issue quite consistently: import pandas as pd
import numpy as np
def first_sunday_of_month_sale():
idx = pd.date_range(start="2021-01-01", periods=365, freq="D")
data = [np.random.randint(65, 95) if ts.weekday() == 6 and ts.day <= 7 else 0 for ts in idx]
return pd.Series(data, index=idx)
series = first_sunday_of_month_sale()
from gluonts.dataset.common import ListDataset
from gluonts.mx.distribution import NegativeBinomialOutput
from gluonts.model.deepar import DeepAREstimator
dataset = ListDataset(
data_iter=[{"start": series.index[0], "target": series.values}],
freq="D",
)
deepar_estimator = DeepAREstimator(
freq="D",
prediction_length=15,
context_length=60,
distr_output=NegativeBinomialOutput(),
)
predictor = deepar_estimator.train(dataset) Example output:
Edit: The PyTorch implementation indeed gives much more meaningful results: from gluonts.torch.model.deepar import DeepAREstimator as TorchDeepAREstimator
from gluonts.torch.modules.distribution_output import NegativeBinomialOutput as TorchNegativeBinomialOutput
torch_deepar_estimator = TorchDeepAREstimator(
freq="D",
prediction_length=15,
context_length=60,
distr_output=TorchNegativeBinomialOutput(),
trainer_kwargs=dict(max_epochs=10)
)
torch_predictor = torch_deepar_estimator.train(dataset) Example result: forecasts = list(torch_predictor.predict(dataset))
forecasts[0].plot() The prediction is not perfect (the spike is on the second Sunday of the month, while training data displayed spikes on the first Sunday of the month) but definitely makes sense. |
I think this suggests that the MXNet-based
|
@sevstafiev apologies for the late intervention here, the problem may have been fixed with #1893: feel free to try again on your data using the code from the |
Description
I am using a retail store dataset with ~ 3000 time series. Their peculiarity is that there are "special days" when the number of sales grows sharply (on average, 1-2 sales per day, on the day of sale ~ 80). I ran into a problem that with an increase in the number of epochs, the model starts to produce NaN loss over time, overtraining and too often to produce a forecast as for a "special day". Having studied the issues, I decided that adding a validation dataset would help with retraining, since, as I understood, along with the validation, a stop mechanism was also built in. As a validation, I began to take the last 60 days of the train (based on the logic of the context_length = 60 parameter) and 28 days after the train (based on the logic of the prediction_length = 28 parameter) for each time series. I use 28 days after validation as a test. This really helped to track the improvement and deterioration of the quality of the model, however, over time, the model starts to produce NaN loss and the quality on validation drops significantly, that is, the problem has not been solved. Nevertheless, the model does not stop training, but goes through all the remaining batches and epochs, giving NaN loss. As a result, after learning, it begins to give excessively high predictions.
In this issue #833 the author was able to overcome the problem with putting a conditional breakpoint in the log_prob() to stop whenever a NaN value is generated. But I did not understand how this can be done and how to get to log_prob() at all. If you can tell me how to do this, then this will also be a good solution to the problem.
Since the model refits each time when a new piece is added to the data, it is impossible to guess the optimal number of epochs. If I put a little (1-3), then the model is underfitting and the quality will be poor, however, the fewer epochs, the less chance that NaN loss will occur. If I put a large number of epochs (20), then NaN loss is guaranteed to occur and the predictions will be bad.
To Reproduce
Error message or code output
(validation_avg_epoch_loss=0.363)
after some time
2 epochs later (avg_epoch_loss=0.766)
Environment
The text was updated successfully, but these errors were encountered: