DeepAR Training failing due to invalid values in degrees of freedom (df) tensor. What is producing this behaviour? #3044
Replies: 4 comments 1 reply
-
UpdateI'm still trying to figure out what is going on here. I've found an iteration that seems to trigger this issue and have managed to go through it a number of times. I've narrowed down the problem. The input data are not problematic. There are no Specifically, in line 251 of deepar/module.py. This line is within
The first part of this return (i.e.
Following on from line 317, in line 327, this
The In my case, the distribution is a StudentT distribution, and the constraint seems to get checked in the So the root of the problem isn't really to do with Questions
In this case, Nothing in here appears to be abnormal or odd or problematic. All 16 such tensors in this case are of a type like this. All seem normal. There is nothing obviously problematic about the values in any of them.
I guess this is probably more of a conceptual question, but the answer isn't obvious (to me). |
Beta Was this translation helpful? Give feedback.
-
Running into same issue. |
Beta Was this translation helpful? Give feedback.
-
Running into same issue with iTransformer model. The problem happens when using GPU and vanishes when only CPU is available. BTW the same code does not work on TPU too but with another error. |
Beta Was this translation helpful? Give feedback.
-
@Serendipity31, @santoshkumarradha, do you managed to solve the issue? Vanishing\exploding gradients can hardly be the reason because toy dataset with constant values for every item_id work well. Different behavior on CPU, GPU and TPU also looks strange for gradients problem. |
Beta Was this translation helpful? Give feedback.
-
The Situation
I'm trying to tune DeepAREstimator hyperparameters, and every so often the call to
estimator.train()
fails in the middle of training due to a ValueError that is related to the degrees of freedom tensor that is required by the output distribution.I have spent a long time trying to follow the traceback to figure out where this tensor gets created to try and figure out how it is that it could get populated with nan values, and I'm essentially none the wiser.
df
) as an input, but not where that input is created for the first time (or why it is a tensor and not an integer).@classmethod
functions within distribution classes nameddomain_map()
that modify thisdf
, but I cannot see any place wheredomain_map()
is called.Can anyone please help??
The Details
The ValueError
The full traceback is at the bottom of this post. Here is the ValueError on its own.
Background
The following may be useful pieces of context:
When training fails due to this error, it fails at some 'middle' epoch (e.g. 10 or 52 or 75 or 113), not at the very start
The dimensions of this tensor match
batch_size
, and I've seen this error triggered with batch sizes both smaller and larger than the example shown above (which happens to be forbatch_size=72
)I've experienced this error both training a DeepAREstimator on its own and within the context of hyperparameter tuning with Optuna.
I don't think it has anything to do with Optuna, but when this happens within the context of hyperparameter tuning with Optuna, I have easy access to logs that show
train_loss
andval_loss
. When this error happens, they are real numbers and they do not seem problematic (e.g. each might be somewhere between 1 and 2 and reasonably close to each other...not super small or super massive)My target series consists of real numbers, but they do have missing values. Sometimes there are quite a few of these in a row. If gluonTS does not implement a check to ensure that each selected window of length
batch_length
has at least one 'real' observation, it's possible that a batch within a particular target series and epoch could exlusively consist ofnan
entries. It's possible this is relevant to this error.I use the default
DummyValueImputation(dummy_value=0.0)
as the imputation method.Training does generate a RuntimeWarning (
RuntimeWarning: invalid value encountered in cast value = np.asarray(data[self.field], dtype=self.dtype)
). I discuss this here: gluonts/issues/3025. I am able to reproduce this warning (scroll to the bottom of that thread), but have not gotten to the bottom of why it is happening. However, this RuntimeWarning happens every time I callestimator.train()
, so these don't seem to be related.Example
Unfortunately, I don't have a reproducible example to provide. There are two reasons for this:
I have not found a way to force it to generate the error on with a reproducible example, partially because I cannot see how
df
is created, which would allow me to create adf
tensor that would directly trigger this error with a small reproducible example.My training efforts with my real data have been interrupted several times with this error, but even with using
from lightning.pytorch import seed_everything
to set seeds for numpy, torch and python.random, I cannot actually reproduce this on command. I have, on one occaision managed to reproduce it a single time by re-running an Optuna study. However, afterwards I took that same dataset, and the same hyperparameter values, and trained the estimator outside of Optuna, and it did not produce the error (even using the same seed). I have also re-run an Optuna study after this error happened, and had the study run to completion without changing anything. It seems like there is some element of pseudo-randomness that is relevant that I'm not managing to control, and that this is making it harder to isolate the issue.Even so, I've included an example of an estimator set up that did trigger this error recently. Although it's not reproducible, it shows the relevant settings.
Full Traceback
Here is the full traceback
Beta Was this translation helpful? Give feedback.
All reactions