NaN losses leads to overfitting #1796

sevstafiev · 2021-11-25T14:56:59Z

Description

I am using a retail store dataset with ~ 3000 time series. Their peculiarity is that there are "special days" when the number of sales grows sharply (on average, 1-2 sales per day, on the day of sale ~ 80). I ran into a problem that with an increase in the number of epochs, the model starts to produce NaN loss over time, overtraining and too often to produce a forecast as for a "special day". Having studied the issues, I decided that adding a validation dataset would help with retraining, since, as I understood, along with the validation, a stop mechanism was also built in. As a validation, I began to take the last 60 days of the train (based on the logic of the context_length = 60 parameter) and 28 days after the train (based on the logic of the prediction_length = 28 parameter) for each time series. I use 28 days after validation as a test. This really helped to track the improvement and deterioration of the quality of the model, however, over time, the model starts to produce NaN loss and the quality on validation drops significantly, that is, the problem has not been solved. Nevertheless, the model does not stop training, but goes through all the remaining batches and epochs, giving NaN loss. As a result, after learning, it begins to give excessively high predictions.

In this issue #833 the author was able to overcome the problem with putting a conditional breakpoint in the log_prob() to stop whenever a NaN value is generated. But I did not understand how this can be done and how to get to log_prob() at all. If you can tell me how to do this, then this will also be a good solution to the problem.

Since the model refits each time when a new piece is added to the data, it is impossible to guess the optimal number of epochs. If I put a little (1-3), then the model is underfitting and the quality will be poor, however, the fewer epochs, the less chance that NaN loss will occur. If I put a large number of epochs (20), then NaN loss is guaranteed to occur and the predictions will be bad.

To Reproduce

trainer = Trainer(
      ctx=device,
      epochs=20, 
      learning_rate_decay_factor=0.5,
      patience=3,
      minimum_learning_rate=0.001,
      clip_gradient=1.0,
      weight_decay=1e-08,
      learning_rate=0.01,
      hybridize = False, #True changed nothing but training speed
      batch_size = 32,
  )

deepar_estimator = DeepAREstimator(
    freq="D", 
    prediction_length=h,
    trainer=trainer,
    context_length=60, 
    num_layers=2,
    num_cells=100,
    cell_type="lstm",
    dropout_rate=0.1,
    use_feat_dynamic_real=True,
    use_feat_static_cat=True,
    cardinality=cardinality,
    distr_output=NegativeBinomialOutput(),
)

Error message or code output

(validation_avg_epoch_loss=0.363)

 0%|          | 0/50 [00:00<?, ?it/s]
 40%|████      | 20/50 [00:10<00:15,  1.99it/s, epoch=3/20, avg_epoch_loss=0.485]
100%|██████████| 50/50 [00:23<00:00,  2.09it/s, epoch=3/20, avg_epoch_loss=0.442]

0it [00:00, ?it/s]
49it [00:10,  4.88it/s, epoch=3/20, validation_avg_epoch_loss=0.327]
123it [00:24,  4.93it/s, epoch=3/20, validation_avg_epoch_loss=0.363]

after some time

  0%|          | 0/50 [00:00<?, ?it/s]
 50%|█████     | 25/50 [00:10<00:10,  2.48it/s, epoch=4/20, avg_epoch_loss=0.522]
WARNING:gluonts.trainer:Batch [46] of Epoch[3] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [47] of Epoch[3] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [49] of Epoch[3] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [50] of Epoch[3] gave NaN loss and it will be ignored

2 epochs later (avg_epoch_loss=0.766)

 92%|█████████▏| 46/50 [00:10<00:00,  4.57it/s, epoch=6/20, avg_epoch_loss=0.766]
WARNING:gluonts.trainer:Batch [47] of Epoch[5] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [48] of Epoch[5] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [49] of Epoch[5] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [50] of Epoch[5] gave NaN loss and it will be ignored
100%|██████████| 50/50 [00:10<00:00,  4.63it/s, epoch=6/20, avg_epoch_loss=0.766]

Environment

Operating system: Linux [GCC 9.3.0]
Python version: 3.9.7
GluonTS version: 0.8.1
MXNet version: 1.8.0.post0

The text was updated successfully, but these errors were encountered:

mbohlkeschneider · 2021-12-07T16:03:00Z

Hi @sevstafiev,

Thank you for raising this issue. One thing I noticed from your snipped is that the learning rate is rather high. I would suggest reducing it to 0.001 or even 0.0001. This often helps with the NaN loss problem.

Alit10 · 2022-01-07T17:57:11Z

Hello,
In the issue #833 it's solved using the pytorch implementation. You can maybe try it like this.
I had the same issue and using a different distribution like the studentOutput solves the nan problem

lostella · 2022-02-16T11:29:47Z

The following snippet appears to reproduce the issue quite consistently:

import pandas as pd
import numpy as np

def first_sunday_of_month_sale():
    idx = pd.date_range(start="2021-01-01", periods=365, freq="D")
    data = [np.random.randint(65, 95) if ts.weekday() == 6 and ts.day <= 7 else 0 for ts in idx]
    return pd.Series(data, index=idx)

series = first_sunday_of_month_sale()

from gluonts.dataset.common import ListDataset
from gluonts.mx.distribution import NegativeBinomialOutput
from gluonts.model.deepar import DeepAREstimator

dataset = ListDataset(
    data_iter=[{"start": series.index[0], "target": series.values}],
    freq="D",
)

deepar_estimator = DeepAREstimator(
    freq="D", 
    prediction_length=15,
    context_length=60, 
    distr_output=NegativeBinomialOutput(),
)

predictor = deepar_estimator.train(dataset)

Example output:

100%|██████████| 50/50 [00:04<00:00, 11.26it/s, epoch=1/100, avg_epoch_loss=1.25]
100%|██████████| 50/50 [00:04<00:00, 11.67it/s, epoch=2/100, avg_epoch_loss=0.35]
100%|██████████| 50/50 [00:03<00:00, 12.53it/s, epoch=3/100, avg_epoch_loss=0.298]
100%|██████████| 50/50 [00:03<00:00, 12.68it/s, epoch=4/100, avg_epoch_loss=0.283]
100%|██████████| 50/50 [00:03<00:00, 12.59it/s, epoch=5/100, avg_epoch_loss=0.274]
100%|██████████| 50/50 [00:04<00:00, 12.46it/s, epoch=6/100, avg_epoch_loss=0.273]
100%|██████████| 50/50 [00:03<00:00, 12.98it/s, epoch=7/100, avg_epoch_loss=0.265]
100%|██████████| 50/50 [00:03<00:00, 12.80it/s, epoch=8/100, avg_epoch_loss=0.262]
100%|██████████| 50/50 [00:03<00:00, 12.83it/s, epoch=9/100, avg_epoch_loss=0.253]
100%|██████████| 50/50 [00:03<00:00, 13.00it/s, epoch=10/100, avg_epoch_loss=0.25]
100%|██████████| 50/50 [00:04<00:00, 11.89it/s, epoch=11/100, avg_epoch_loss=0.241]
100%|██████████| 50/50 [00:03<00:00, 13.03it/s, epoch=12/100, avg_epoch_loss=0.236]
100%|██████████| 50/50 [00:03<00:00, 12.72it/s, epoch=13/100, avg_epoch_loss=0.226]
  0%|          | 0/50 [00:00<?, ?it/s]Batch [2] of Epoch[13] gave NaN loss and it will be ignored
Batch [6] of Epoch[13] gave NaN loss and it will be ignored
Batch [9] of Epoch[13] gave NaN loss and it will be ignored
Batch [11] of Epoch[13] gave NaN loss and it will be ignored
[...]

Edit: The PyTorch implementation indeed gives much more meaningful results:

from gluonts.torch.model.deepar import DeepAREstimator as TorchDeepAREstimator
from gluonts.torch.modules.distribution_output import NegativeBinomialOutput as TorchNegativeBinomialOutput

torch_deepar_estimator = TorchDeepAREstimator(
    freq="D", 
    prediction_length=15,
    context_length=60, 
    distr_output=TorchNegativeBinomialOutput(),
    trainer_kwargs=dict(max_epochs=10)
)

torch_predictor = torch_deepar_estimator.train(dataset)

Example result:

forecasts = list(torch_predictor.predict(dataset))
forecasts[0].plot()

The prediction is not perfect (the spike is on the second Sunday of the month, while training data displayed spikes on the first Sunday of the month) but definitely makes sense.

lostella · 2022-02-16T12:15:02Z

I think this suggests that the MXNet-based NegativeBinomial implementation has some problems, possibly with the way it's parametrized. Two things that could solve this:

Fix the NegativeBinomialOutput class to output parameters in the right range (I'm not sure alpha should be anything positive, and why should mu not allowed to be zero?)
Update NegativeBinomial to use a different parametrization based on the failure count and logit, like the one from PyTorch.

lostella · 2022-02-17T13:11:52Z

@sevstafiev apologies for the late intervention here, the problem may have been fixed with #1893: feel free to try again on your data using the code from the master branch, I'd be curious to see whether that was the issue here.

sevstafiev added the bug Something isn't working label Nov 25, 2021

This was referenced Feb 16, 2022

Change negative binomial parametrization to failure count and log-odds #1890

Closed

Fix negative binomial parameter map #1893

Merged

lostella mentioned this issue Feb 17, 2022

Getting NaN loss after some time with DeepAR and NegativeBinomial #833

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN losses leads to overfitting #1796

NaN losses leads to overfitting #1796

sevstafiev commented Nov 25, 2021

mbohlkeschneider commented Dec 7, 2021

Alit10 commented Jan 7, 2022

lostella commented Feb 16, 2022 •

edited

Loading

lostella commented Feb 16, 2022

lostella commented Feb 17, 2022

NaN losses leads to overfitting #1796

NaN losses leads to overfitting #1796

Comments

sevstafiev commented Nov 25, 2021

Description

To Reproduce

Error message or code output

Environment

mbohlkeschneider commented Dec 7, 2021

Alit10 commented Jan 7, 2022

lostella commented Feb 16, 2022 • edited Loading

lostella commented Feb 16, 2022

lostella commented Feb 17, 2022

lostella commented Feb 16, 2022 •

edited

Loading