Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN losses leads to overfitting #1796

Open
sevstafiev opened this issue Nov 25, 2021 · 5 comments
Open

NaN losses leads to overfitting #1796

sevstafiev opened this issue Nov 25, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@sevstafiev
Copy link

Description

I am using a retail store dataset with ~ 3000 time series. Their peculiarity is that there are "special days" when the number of sales grows sharply (on average, 1-2 sales per day, on the day of sale ~ 80). I ran into a problem that with an increase in the number of epochs, the model starts to produce NaN loss over time, overtraining and too often to produce a forecast as for a "special day". Having studied the issues, I decided that adding a validation dataset would help with retraining, since, as I understood, along with the validation, a stop mechanism was also built in. As a validation, I began to take the last 60 days of the train (based on the logic of the context_length = 60 parameter) and 28 days after the train (based on the logic of the prediction_length = 28 parameter) for each time series. I use 28 days after validation as a test. This really helped to track the improvement and deterioration of the quality of the model, however, over time, the model starts to produce NaN loss and the quality on validation drops significantly, that is, the problem has not been solved. Nevertheless, the model does not stop training, but goes through all the remaining batches and epochs, giving NaN loss. As a result, after learning, it begins to give excessively high predictions.

In this issue #833 the author was able to overcome the problem with putting a conditional breakpoint in the log_prob() to stop whenever a NaN value is generated. But I did not understand how this can be done and how to get to log_prob() at all. If you can tell me how to do this, then this will also be a good solution to the problem.

Since the model refits each time when a new piece is added to the data, it is impossible to guess the optimal number of epochs. If I put a little (1-3), then the model is underfitting and the quality will be poor, however, the fewer epochs, the less chance that NaN loss will occur. If I put a large number of epochs (20), then NaN loss is guaranteed to occur and the predictions will be bad.

To Reproduce

trainer = Trainer(
      ctx=device,
      epochs=20, 
      learning_rate_decay_factor=0.5,
      patience=3,
      minimum_learning_rate=0.001,
      clip_gradient=1.0,
      weight_decay=1e-08,
      learning_rate=0.01,
      hybridize = False, #True changed nothing but training speed
      batch_size = 32,
  )

deepar_estimator = DeepAREstimator(
    freq="D", 
    prediction_length=h,
    trainer=trainer,
    context_length=60, 
    num_layers=2,
    num_cells=100,
    cell_type="lstm",
    dropout_rate=0.1,
    use_feat_dynamic_real=True,
    use_feat_static_cat=True,
    cardinality=cardinality,
    distr_output=NegativeBinomialOutput(),
)

Error message or code output

(validation_avg_epoch_loss=0.363)

 0%|          | 0/50 [00:00<?, ?it/s]
 40%|████      | 20/50 [00:10<00:15,  1.99it/s, epoch=3/20, avg_epoch_loss=0.485]
100%|██████████| 50/50 [00:23<00:00,  2.09it/s, epoch=3/20, avg_epoch_loss=0.442]

0it [00:00, ?it/s]
49it [00:10,  4.88it/s, epoch=3/20, validation_avg_epoch_loss=0.327]
123it [00:24,  4.93it/s, epoch=3/20, validation_avg_epoch_loss=0.363]

after some time

  0%|          | 0/50 [00:00<?, ?it/s]
 50%|█████     | 25/50 [00:10<00:10,  2.48it/s, epoch=4/20, avg_epoch_loss=0.522]
WARNING:gluonts.trainer:Batch [46] of Epoch[3] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [47] of Epoch[3] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [49] of Epoch[3] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [50] of Epoch[3] gave NaN loss and it will be ignored

2 epochs later (avg_epoch_loss=0.766)

 92%|█████████▏| 46/50 [00:10<00:00,  4.57it/s, epoch=6/20, avg_epoch_loss=0.766]
WARNING:gluonts.trainer:Batch [47] of Epoch[5] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [48] of Epoch[5] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [49] of Epoch[5] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [50] of Epoch[5] gave NaN loss and it will be ignored
100%|██████████| 50/50 [00:10<00:00,  4.63it/s, epoch=6/20, avg_epoch_loss=0.766]

Environment

  • Operating system: Linux [GCC 9.3.0]
  • Python version: 3.9.7
  • GluonTS version: 0.8.1
  • MXNet version: 1.8.0.post0
@sevstafiev sevstafiev added the bug Something isn't working label Nov 25, 2021
@mbohlkeschneider
Copy link
Contributor

Hi @sevstafiev,

Thank you for raising this issue. One thing I noticed from your snipped is that the learning rate is rather high. I would suggest reducing it to 0.001 or even 0.0001. This often helps with the NaN loss problem.

@Alit10
Copy link

Alit10 commented Jan 7, 2022

Hello,
In the issue #833 it's solved using the pytorch implementation. You can maybe try it like this.
I had the same issue and using a different distribution like the studentOutput solves the nan problem

@lostella
Copy link
Contributor

lostella commented Feb 16, 2022

The following snippet appears to reproduce the issue quite consistently:

import pandas as pd
import numpy as np

def first_sunday_of_month_sale():
    idx = pd.date_range(start="2021-01-01", periods=365, freq="D")
    data = [np.random.randint(65, 95) if ts.weekday() == 6 and ts.day <= 7 else 0 for ts in idx]
    return pd.Series(data, index=idx)

series = first_sunday_of_month_sale()

from gluonts.dataset.common import ListDataset
from gluonts.mx.distribution import NegativeBinomialOutput
from gluonts.model.deepar import DeepAREstimator

dataset = ListDataset(
    data_iter=[{"start": series.index[0], "target": series.values}],
    freq="D",
)

deepar_estimator = DeepAREstimator(
    freq="D", 
    prediction_length=15,
    context_length=60, 
    distr_output=NegativeBinomialOutput(),
)

predictor = deepar_estimator.train(dataset)

Example output:

100%|██████████| 50/50 [00:04<00:00, 11.26it/s, epoch=1/100, avg_epoch_loss=1.25]
100%|██████████| 50/50 [00:04<00:00, 11.67it/s, epoch=2/100, avg_epoch_loss=0.35]
100%|██████████| 50/50 [00:03<00:00, 12.53it/s, epoch=3/100, avg_epoch_loss=0.298]
100%|██████████| 50/50 [00:03<00:00, 12.68it/s, epoch=4/100, avg_epoch_loss=0.283]
100%|██████████| 50/50 [00:03<00:00, 12.59it/s, epoch=5/100, avg_epoch_loss=0.274]
100%|██████████| 50/50 [00:04<00:00, 12.46it/s, epoch=6/100, avg_epoch_loss=0.273]
100%|██████████| 50/50 [00:03<00:00, 12.98it/s, epoch=7/100, avg_epoch_loss=0.265]
100%|██████████| 50/50 [00:03<00:00, 12.80it/s, epoch=8/100, avg_epoch_loss=0.262]
100%|██████████| 50/50 [00:03<00:00, 12.83it/s, epoch=9/100, avg_epoch_loss=0.253]
100%|██████████| 50/50 [00:03<00:00, 13.00it/s, epoch=10/100, avg_epoch_loss=0.25]
100%|██████████| 50/50 [00:04<00:00, 11.89it/s, epoch=11/100, avg_epoch_loss=0.241]
100%|██████████| 50/50 [00:03<00:00, 13.03it/s, epoch=12/100, avg_epoch_loss=0.236]
100%|██████████| 50/50 [00:03<00:00, 12.72it/s, epoch=13/100, avg_epoch_loss=0.226]
  0%|          | 0/50 [00:00<?, ?it/s]Batch [2] of Epoch[13] gave NaN loss and it will be ignored
Batch [6] of Epoch[13] gave NaN loss and it will be ignored
Batch [9] of Epoch[13] gave NaN loss and it will be ignored
Batch [11] of Epoch[13] gave NaN loss and it will be ignored
[...]

Edit: The PyTorch implementation indeed gives much more meaningful results:

from gluonts.torch.model.deepar import DeepAREstimator as TorchDeepAREstimator
from gluonts.torch.modules.distribution_output import NegativeBinomialOutput as TorchNegativeBinomialOutput

torch_deepar_estimator = TorchDeepAREstimator(
    freq="D", 
    prediction_length=15,
    context_length=60, 
    distr_output=TorchNegativeBinomialOutput(),
    trainer_kwargs=dict(max_epochs=10)
)

torch_predictor = torch_deepar_estimator.train(dataset)

Example result:

forecasts = list(torch_predictor.predict(dataset))
forecasts[0].plot()

image

The prediction is not perfect (the spike is on the second Sunday of the month, while training data displayed spikes on the first Sunday of the month) but definitely makes sense.

@lostella
Copy link
Contributor

I think this suggests that the MXNet-based NegativeBinomial implementation has some problems, possibly with the way it's parametrized. Two things that could solve this:

  1. Fix the NegativeBinomialOutput class to output parameters in the right range (I'm not sure alpha should be anything positive, and why should mu not allowed to be zero?)
  2. Update NegativeBinomial to use a different parametrization based on the failure count and logit, like the one from PyTorch.

@lostella
Copy link
Contributor

@sevstafiev apologies for the late intervention here, the problem may have been fixed with #1893: feel free to try again on your data using the code from the master branch, I'd be curious to see whether that was the issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants