-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model.evaluate() gives a different loss on training data from the one in training process #6977
Comments
Same problem happens for me... |
It's due to the dropout layers. During the training phase neurons are dropped. In contrast during the prediction all neurons remain in the network structure. So it's quite likely that the results will be different. Edit: The batch normalization layers also influence the results. |
Regarding the problem that both losses are quite different, it looks like that your model structure does not fit the problem well. |
Even without dropout layers and batch normalization, same issue continues for me. I don't agree that the problem is caused by the model structure because the training and the test data is the same. |
How large is the difference in your case? |
I use only one batch. In training, final loss (mse) is 0.045. Evaluating with training data gives 1.14 |
That's strange. Did you try to use a different dataset? |
#6895 I have a similar problem and even tried with the public data set. I was doing fine tuning. |
I had an issue like that one, the solution for me was very simple. I was evaluating using the train data and the accuracy was quite different than the one while training. When evaluating I had swapped the dims of the input images, height was width and width was height (silly me) |
Hi guys, Other than dropout, batch norm also causes the same problem. I suspect that this is caused by the fact that the number of samples used in bacth norm after activation is 200 (bacth size) in tarining time, however it is only 1 in test time. This causes different normalization and different loss. What's your thoughts? |
#6895 Yes, I just encountered that problem with Resnet50. |
I'm running into the same problem. When I create learning curves from fit metrics, train and test look unrealistically different. As an experiment, I tried calculating my own metrics. class SecondOpinion(Callback):
def __init__(self, model, x_train, y_train, x_test, y_test):
self.model = model
self.x_train = x_train
self.y_train = y_train
self.x_test = x_test
self.y_test = y_test
def on_epoch_end(self, epoch, logs={}):
y_train_pred = self.model.predict(self.x_train)
y_test_pred = self.model.predict(self.x_test)
mse_train = ((y_train_pred - self.y_train) ** 2).mean()
mse_test = ((y_test_pred - self.y_test) ** 2).mean()
print("\n Second Opinion loss: %5.4f - val_loss: %5.4f" % (mse_train, mse_test))
...
model.compile(
loss='mean_squared_error',
optimizer=adam
)
second_opinion = SecondOpinion(model, data.x_train, data.y_train, data.x_test, data.y_test)
model.fit(
x=data.x_train,
y=data.y_train,
validation_data=(data.x_test, data.y_test),
batch_size=200,
epochs=200
callbacks=[second_opinion]
)
With batch normalization and drop out included, train loss is very different (~3x). Validation losses are different, but not substantial.
With batch normalization and drop out removed, loss is somewhat different and val_loss matches
I'm not schooled enough to know if these differences are intentional by Keras or not. Anyone? |
I am new to Keras so maybe this is expected behaviour but I can't find it documented in
Loading weights from the file again after Result of initial evaluate:
Now train one step:
Now run same evaluate call again:
The code to produce this is:
I am using Keras (2.1.4) installed by pip on MacOS 10.13.4. This version of Keras is printing a ton of deprecation warnings (from tensorflow I think) which I have omitted in the output for clarity but if you see them it is not a problem with the code. |
I'm still in over my head here, but here's how things appear to me. Can anyone confirm I'm on the right track? This is all tied to learning_phase (see https://keras.io/backend/) and loss/metric estimation based on batches. Dropout is only active when the learning_phase is set to test. Otherwise, it should be ignored. It's unclear to me if BatchNormalization is active when learning_phase is test Batching presumes each batch can represent the entire data set. If the data is heavily skewed or batches aren't well randomized, I can imagine this will magnify the differences between losses from fit vs. predict. It seems to me that Learning Curves are more correct when evaluating losses and metrics when learning_phase is set to test and applied across all batches. I can imagine this is not done during fit, because it is computationally expensive. |
I'm seeing the same problem |
I have same problem. Epoch 29/30 5760/5760 [==============================] - 4s 644us/step - loss: 0.0163 - acc: 0.9932 - val_loss: 0.0296 - val_acc: 0.9875 Epoch 30/30 5760/5760 [==============================] - 4s 641us/step - loss: 0.0165 - acc: 0.9925 - val_loss: 0.0318 - val_acc: 0.9875 Evaluating on test data: `1712/1712 [==============================] - 0s 236us/step $loss [1] 0.329597 $acc [1] 0.9281542` There is a huge difference between train-validation loss and test loss. |
I am having the same issue. I train a model, save the weights, load the model. The resulting evaluation call is giving results that change each time. |
I, too have the same issue. I was training a DenseNet121 with all layers frozen except the last 1 or 2.
I ran I'm planning to drop Keras and move to TF. |
I am facing the same issue...trying to finetune inception_v3. Added two Dense layers and set all other inception layers trainable=False. So without any dropout layers, getting completely different metrics for training data during training and evaluation!! print(model.metrics_names, model.evaluate_generator(train_gen), model.evaluate_generator(val_gen)) As none of the inception layers are being trained, the batch norm layers should use default mean and std dev and hence shouldn't give different results in training and evaluation phase! |
Has anyone solved this? Having the same issue. model.eval gives completely different results compared to model.fit (learning rate set to zero). Don't use dropout layer. Tried playing with the batch_norm layers "trainable" parameter but got similar performance. |
I am having the same problem as well. In my case, I am trying to reuse the pre-trained keras ResNet50 model and add my own last few layers. I got very large differences between .fit and .evaluate using the same training data. When I look at the prediction result using the training data, it's clear the .evaluate gives the right loss/accuracy. Anyone has any ideas? I don't believe the batchnorm/dropout layer is the reason here. Below is my differences: From evaluate with the same training data: |
Hello everyone, Here is the official Kera's answer to this question. Even without dropout or batch normalization, the problem will persist. The reason for this is that when you use To sum everything up, |
Hey j0bby, Thanks for your reply. |
Hello @ub216, Here is how to have the same output from
About @shunjiangxu 's results: The two methods return different results as expected. However, in that case, the evaluate method expected to have better results is performing worth. This can have different explanations depending on the hyperparameters of the model, and the training. |
Hi @j0bby, Thanks for trying to get to the bottom of this. If you run your code with the order of commands reversed do you still get matching results? Like this:
In my example above from the 19th of Feb running |
Hey @j0bby
I get:
I checked the difference in weights before and after executing the command but the weights haven't changed! Any idea/pointers on why this discrepancy? @mikowals I had the same issue with this model as well. I'm trying to figure this out first but maybe they are related. |
Thanks for following this up. I am trying to dig out the reason for this. For my case, the evaluate gave a much much worse [loss, accuracy] result than the .fit. I am trying to use a VGG16 model instead of the Resnet50. |
@shunjiangxu |
@ub216 All right, sorry I did not understand correctly. I was searching for a Keras callback function/parameters to print out the .fit output but can't seem to find it. The only way seems to run the .evaluate with on_batch_end/on_epoch_end. But that is not really what the .fit has calculated. Does anyone know if callback can get the .fit 'prediction' output? |
Has anyone solved the problem yet, I'm facing the same problem here... here are the results: epoch=281, loss=16.09882 max_margin_loss=15.543743 ortho_loss=0.5550766 |
it might be the problem i have described in the stackoverflow post: |
Hey guys, I found an easy solution which works at least in my case (My model has Dropout layer but with no BatchNormalization layer), thanks to OverLordGoldDragon in the link here and here The easy fix for me is to set keras learning phase to 0 before building and initializing my model:
Now the four results(model.train_on_batch, model.evaluate, model.predict, model.test_on_batch) are all as expected. below are the experiment output: |
@Osdel Why your change will fix your problem? |
|
I just remove the dropout and have no problem any more. |
Looking into callback, I think the model.evaluate returns the final loss and accuracy, while the verbosity mode prints the average loss and accuracy of the epoch as it in stored in logs.
If we need final performance on training, we could use callback to store it, which is slow but don't know if there in any option to force log to record final performance rather than average. |
Same problem. I do not have dropout layers, but there are a few batch normalizations. The difference between validation accuracy while training and after evaluating around 15%.
I was saving the best model while training. So, the best epoch was:
But the evaluation on validation subset is:
Even if there is an influence of batch normalisation, is it okay to have that much improvement? P.S. The test accuracy is also around 95%, so I think this number is pretty representive. I just confused because of difference. |
I just hit this problem with sparse_categorical_accuracy. I believed during training, whatever it is reported is totally wrong. I ran model.evaluate on the same train_ds and obtain an answer that agree with y_pred = model.predict(...), and then explicitly compute the metrics with y_pred and y. For my case, the sparse_categorical_accuracy during training is way better than it should be. Looks like this issue has been opened for a long time.... not sure if anyone knows the ans. |
Hi everyone - if you use different batch_size for .fit() and .evaluate(), you will get a different loss. |
I came to understand my specific case, it has to do with presence of batch norm layer, which can lead to a diff “prediction” during training vs. evaluate. For a simple model and if i remove any batch norm and use full batch GD, the train set metrics will be exactly the same during training or evaluation. |
@emerygoossens |
It is due to dropout. |
Hi, I'm experiencing an issue with batch normalization layers wondering if anyone has a bit of helpful insight. My issue is described in full on stack overflow here. In that post there's also demo code you can run and a complete dataset you can download. There have been no direct answers yet, but one commenter was able to confirm that the issue is caused by batch normalization layers. I'm not completely sure that this issue is connected to my issue, but I have been suspecting it. |
I had a similar problem with BatchNormalization, and after lots of investigation, and help of a friend, he pointed out that the issue can be caused by a very small moving momentum in BatchNormalization -- typically set in Keras Applications to 0.999. So the evaluation would use garbage means/variance (while training use the in-batch mean/variance). The fix was to introspect into the model and change the momentum of all BratchNormalization layers -- see example code on the stack overflow post mentioned on the comment above. A nice fix would be for the BatchNormalization library to use a normal mean up to a certain number of examples, and only afterwards start using the moving average. |
EDIT - If you've got this far on this thread, don't be stubborn/lazy like me (pre-edit below). Just do the batch norm thing. It cleaned everything up for me. The code is here. Thanks @mgroth0 What a journey this ticket has been. Regular PyTorch user here trying to retrain a model in TensorFlow. I find this totally bizarre as I've trained this model 100 times on PyTorch. I use the same datagen for train and val and this is what the progbar printout looks like
Somehow my val accuracy is way off train accuracy and takes a nose dive after a few epochs... I would look into the batch norm thing, but it seems so wrong that it can't be right. |
Issue: model.evaluate(val_data) gives abysmal result compared to val_accuracy during Description: I have custom image data divided into train/val/test sets. For loading the data, I'm using Solution: Load model from a checkpoint with
After this, |
Issue: The output layer for the last epoch of Testing: I observed that What I obtain is that the output for what However, the separate output for the validation is identical to a post-fit application of Solution: This is most probably due to a last back propagation between the last training and the last validation of It has nothing to do with BackPropagation: I tried both using it or not. |
I'm not sure why this is closed when people are still having issues. Here is a good example: We all understand that model.fit() works slightly differently than model.evaluate() because of how it updates weights between batches. But there still should be no reason why model.evaluate would produce a training loss (for example MAE) that is an order of magntude higher on the same X_train dataset @Raverss that solution unfortunately did not work. I'm not sure why it did for you, but loading the model from checkpoint or from a file created by @alexander-soare this problem occurs even when there are NO batch normalization layers. So your proposed solution makes sense, but how do you change the behavior of a layer that's not even in the model??? @gitpeblo Same observations as you. The result of the @BrianHuf It does seem that setting Has anyone successfully resolved this? |
Hi, if you are using |
Turning on/off shuffling shouldn't change the loss or accuracy. I agree that turning shuffling off for the testing data is sensible thing to do. |
I'm implementing a CNN model, when I just have few layers, it works well. When I tried a deeper network, I can achieve a high performance (a small loss given during the training process) on training data, but when I use model.evaluate() on training data, I get a poor performance (much greater loss). I wonder why this will happen since the evaluation are all on training data.
Here is what I got:
The log during training:
When I evaluate on training data:
On validation data
Could someone help me? Thanks a lot.
The text was updated successfully, but these errors were encountered: