Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rm data leakage in example notebooks #2008

Closed
wants to merge 3 commits into from

Conversation

SimTheGreat
Copy link
Contributor

fixes #644

Other Information

fix scaling in example notebook

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov-commenter
Copy link

Codecov Report

All modified lines are covered by tests ✅

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

see 6 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

Copy link
Collaborator

@dennisbader dennisbader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this @SimTheGreat this is a nice cleaning up of the notebooks 🚀

I added some suggestions in the ReviewNB. The main suggestions were:

  • try to revert changes to files where no code changed -> we rude the git diffs
  • try to upgrade ipywidgets and rerun all the notebooks with changes (I explain in one of the comments how to do it): this will make the progress bar look better and improve the notebook appearance
  • we can actually avoid scaling/transforming for many of the covariates as they were one hot encoded.

@@ -2970,7 +2970,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
"version": "3.10.11"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, let's revert this change

Suggested change
"version": "3.10.11"
"version": "3.9.16"

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 28, 2023

View / edit / reply to this conversation on ReviewNB

dennisbader commented on 2023-09-28T08:13:36Z
----------------------------------------------------------------

There is an issue with ipywidgets not showing the progess bar properly (the "Predicting: 0it [00:00, ?it/s]")

I had this as well. We can take this opportunity to improve the progess bar appearance :)

Could you try pip install -U ipywidgets in your environment, close and reopen the jupyter session and run all the example notebooks where you applied changes again? The expected bar should look something like Lightning-AI/pytorch-lightning#1399 (comment)


@review-notebook-app
Copy link

review-notebook-app bot commented Sep 28, 2023

View / edit / reply to this conversation on ReviewNB

dennisbader commented on 2023-09-28T08:13:37Z
----------------------------------------------------------------

Can you rename air_train/val and milk_train/val in the beginning to air_train/val_covariates and mil_train/val_covariates and then use the transformer/scaler to overwrite them?

Just have a clear distinction between the covariates and the target series (train/val_air,train/val_milk )


@@ -4450,7 +4450,7 @@
},
Copy link
Collaborator

@dennisbader dennisbader Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there were no code changes in this notebook. Can you try to revert the changes to this file? We usually try to keep the git history/diffs as minimal as possible :)


Reply via ReviewNB

@@ -54,19 +54,22 @@
"source": [
Copy link
Collaborator

@dennisbader dennisbader Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the following lines in this cell at the bottom?

We want to avoid some info from PyTorch lightning (mostly the device information that get spammed when calling historical forecasts)

import warnings

warnings.filterwarnings("ignore")
import logging

logging.disable(logging.CRITICAL)



Reply via ReviewNB

@@ -54,19 +54,22 @@
"source": [
Copy link
Collaborator

@dennisbader dennisbader Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #16.    scaler_month = Scaler()

actually we can avoid scaling the month series, as it's one hot encoded anyways (encodes months as n months - 1 binary columns).

Can you remove the scaling here?


Reply via ReviewNB

@@ -54,19 +54,22 @@
"source": [
Copy link
Collaborator

@dennisbader dennisbader Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this here will definitely look better after upgrading ipywidgets (my comments in notebook 01)


Reply via ReviewNB

@@ -54,19 +54,22 @@
"source": [
Copy link
Collaborator

@dennisbader dennisbader Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #17.    series_en_transformed = scaler_en.fit_transform(series_en)

this line should be somewhere below fitting the transformer on train_en and using transform() 


Reply via ReviewNB

@@ -54,19 +54,22 @@
"source": [
Copy link
Collaborator

@dennisbader dennisbader Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #26.    scaler_day = Scaler()

here we can avoid scaling again as we use a one hot encoding for the days


Reply via ReviewNB

@@ -37,7 +37,8 @@
"from darts.models import NBEATSModel\n",
Copy link
Collaborator

@dennisbader dennisbader Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #10.    from darts.timeseries import concatenate

change to

from darts import concatenate



Reply via ReviewNB

@@ -11,7 +11,7 @@
},
Copy link
Collaborator

@dennisbader dennisbader Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rerun this notebook? seems like there were some issues with too short series for validation


Reply via ReviewNB

@@ -11,7 +11,7 @@
},
Copy link
Collaborator

@dennisbader dennisbader Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #27.    train, val = day_series.split_after(pd.Timestamp("20170901"))

we can avoid scaling the day series as it's one-hot encoded


Reply via ReviewNB

@@ -11,7 +11,7 @@
},
Copy link
Collaborator

@dennisbader dennisbader Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #2.        series=day_scaled,

this should be

series=series_en_transformed,

Reply via ReviewNB

@@ -44,7 +44,7 @@
"from darts import TimeSeries\n",
Copy link
Collaborator

@dennisbader dennisbader Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #11.    from darts.timeseries import concatenate

from darts import concatenate


Reply via ReviewNB

@@ -44,7 +44,7 @@
"from darts import TimeSeries\n",
Copy link
Collaborator

@dennisbader dennisbader Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #23.    scaler_day = Scaler()

avoid scaling as one hot encoded


Reply via ReviewNB

@@ -44,7 +44,7 @@
"from darts import TimeSeries\n",
Copy link
Collaborator

@dennisbader dennisbader Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #2.        series=concatenate([train_en_transformed, val_en_transformed]),

Let's define series_en_transformed again before (where we apply the scaler), and then reuse it here and in the next cell. Like this we can avoid having to compute series_en_transformed twice, which can reduce time for longer series.


Reply via ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 28, 2023

View / edit / reply to this conversation on ReviewNB

dennisbader commented on 2023-09-28T08:13:47Z
----------------------------------------------------------------

It looks like there weren't any code changes in this file. Can you try to revert the changes?


@SimTheGreat SimTheGreat mentioned this pull request Oct 5, 2023
@madtoinou madtoinou closed this Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update scaling in notebook(s) to avoid fitting scaler on val set
4 participants