-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for large datasets #229
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Hey. Sorry for taking so long to review, I'll take a look now. The errors weren't exactly related to python 3.7 but rather scikit-learn >= 1.2, which was released in Dec 2022, not so long ago to stop supporting it. Can you please use something like the following instead? try:
encoder = OneHotEncoder(categories=categories, sparse_output=sparse_s, dtype=np.float32)
except TypeError: # sklearn < 1.2
encoder = OneHotEncoder(categories=categories, sparse=sparse_s, dtype=np.float32) |
Hi Jose, yeah, thanks for the tip. I did it. Note: the currently biggest point of discussion is the last bullet point in the PR description. It seems that the change you (Fede) have introduced in PR #190 is very unintuitive to users. I'm not sure what's the right way to handle this, but you have some issues open and I also had complaints from my collaborators. The warning I added is just a small plaster on a big wound. But this could also be taken out of this PR if desired. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor comments. Also can you please revert the changes to python 3.7? The library is still compatible with it and we want to support it for a little big longer.
@jmoralez I think all done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! LGTM
Some minor but somewhat helpful changes:
S.values
which creates the dense matrix from S whensparse_s = True
in theaggregate
function. This avoids OOM errors for very large datasets (I've used it [theaggregate
function] successfully on a 400k time-series dataset after the change).sparse
argument. Backwards compatibility with old sklearn libraries is maintained.Excluded from this PR:
aggregate
function, as current behaviour is quite unintuitive and leads to users' confusion.[EDIT: Aaah, ok, it seems that this behaviour was introduced as a result of this PR here: [FIX] Aggregate unbalanced datasets #190, see also this issue: Broken aggregate function in main branch #211 . What was the reason again to move this line
Y_bottom_df = Y_bottom_df.groupby(['unique_id', 'ds'])['y'].sum().reset_index()
two lines up?][EDIT 2: Perhaps there should be an additional argument with an option whether to prepend the shorter time series with zeros, or not?]