Standardization of test data in Lab 6 should use training mean and standard deviation #11

covuworie · 2018-07-21T15:50:05Z

Observed behavior

Hi, there are bugs in classification-and-pca-lab.ipynb for Lab 6 in the do_classify and classify_from_dataframe methods. When standardizing the testing data, its mean and standard deviation are used. This is incorrect for several reasons such as:

No information from the testing data should be used in the model prediction as it is a form of data snooping. The testing dataset has been contaminated by this.
The same variable is not being created during the transformation of the training and testing sets

Expected behavior

The training data mean and standard deviation should be used for standardizing the testing data like so:

dftest=(subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()

Xte = (subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()

I think this was mentioned in one of the earlier lectures and here are some more references:

The text was updated successfully, but these errors were encountered:

pavlosprotopapas · 2018-07-21T21:49:07Z

Is it true though? I never managed to convince myself that this is not right to use the mean and std of the whole datasets. It is obvious that we should not use the test set mean and std but I never managed to prove that using the whole dataset is harmful (and I never seen a proof anywhere). It seems to be an accepted precaution. On the contrary, I have many examples that normalize/std-ize in the train and apply to rest can lead to many problems. Think a large dataset, where train (and test) are just a small subset. Pavlos

…

On Sat, Jul 21, 2018 at 10:50 AM covuworie ***@***.***> wrote: Observed behavior Hi, there is a bug in classification-and-pca-lab.ipynb <https://github.com/cs109/a-2017/blob/master/Labs/Lab6_Classification_PCA/classification-and-pca-lab.ipynb> for Lab 6 in the do_classify method. When standardizing the testing data, its mean and standard deviation are used. This is incorrect for several reasons such as: - No information from the testing data should be used in the model prediction as it is a form of *data snooping*. The testing dataset has been contaminated by this. - The same variable is not being created during the transformation of the training and testing sets Expected behavior The training data mean and standard deviation should be used for standardizing the testing data like so: dftest=(subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std() I think this was mentioned in one of the earlier lectures and here are some more references: - https://stats.stackexchange.com/questions/202287/why-standardization-of-the-testing-set-has-to-be-performed-with-the-mean-and-sd - https://sebastianraschka.com/faq/docs/scale-training-test.html - https://www.researchgate.net/post/If_I_used_data_normalization_x-meanx_stdx_for_training_data_would_I_use_train_Mean_and_Standard_Deviation_to_normalize_test_data — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#11>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFwvU87oZVCBcaDxACPS4GtBbN8PKxARks5uI02vgaJpZM4VZq4N> .

-- Pavlos Protopapas ----------------- Scientific Program Director, Institute for Applied Computational Science Harvard School of Engineering and Applied Sciences Maxwell Dworkin, 33 Oxford Street Cambridge, MA 02138 http://iacs.seas.harvard.edu/ pavlos@seas.harvard.edu | 617-496-2611

covuworie · 2018-08-01T21:35:44Z

Hi Pavlos,

Thanks for the response. Am I missing something here? As you say, "It is obvious that we
should not use the test set mean and std". However, this is precisely the bug (notice the use of the itest indices) I am reporting since it is what is being done in cell 18 in the do_classify function:

itrain, itest = train_test_split(range(subdf.shape[0]), train_size=train_size)
if standardize:
    dftrain=(subdf.iloc[itrain] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()
    dftest=(subdf.iloc[itest] - subdf.iloc[itest].mean())/subdf.iloc[itest].std()

The same is also done in cell 20 in the classify_from_dataframe function.

Now referring to whether it is correct to use use the mean and std deviation of the whole dataset. As the Sebastian Raschka link above says:

'Note that in practice, if the dataset is sufficiently large, we wouldn’t notice any substantial difference between the scenarios 1-3 because we assume that the samples have all been drawn from the same distribution.'

In this case there are only 212 observations in the training set and 142 observations in the test set which is not a lot (especially compared with 63 predictors).

I think the main point the various authors are making is one of data leakage / data snooping when the entire training set mean and std are used. The example that is used in the article mentioned above makes a lot of sense:

'Again, why Scenario 3? The reason is that we want to pretend that the test data is “new, unseen data.” We use the test dataset to get a good estimate of how our model performs on any new data. Now, in a real application, the new, unseen data could be just 1 data point that we want to classify. (How do we estimate mean and standard deviation if we have only 1 data point?) That’s an intuitive case to show why we need to keep and use the training data parameters for scaling the test set.'

Yes I agree that in practice it may not make much of a difference compared to using the training set mean and standard deviation if the sample size is large and they observations are drawn independently from the same distribution. Yes we could check this before deciding. But why even take the chance?

I think the answer to this question provides a great explanation and also links to further reputable resources which discuss the issue:

https://stats.stackexchange.com/questions/174823/how-to-apply-standardization-normalization-to-train-and-testset-if-prediction-i

Chuk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardization of test data in Lab 6 should use training mean and standard deviation #11

Standardization of test data in Lab 6 should use training mean and standard deviation #11

covuworie commented Jul 21, 2018 •

edited

Loading

pavlosprotopapas commented Jul 21, 2018 via email

covuworie commented Aug 1, 2018

Standardization of test data in Lab 6 should use training mean and standard deviation #11

Standardization of test data in Lab 6 should use training mean and standard deviation #11

Comments

covuworie commented Jul 21, 2018 • edited Loading

Observed behavior

Expected behavior

pavlosprotopapas commented Jul 21, 2018 via email

covuworie commented Aug 1, 2018

covuworie commented Jul 21, 2018 •

edited

Loading