-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added SINDy library input option to WeakPDELibrary #433
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #433 +/- ##
==========================================
+ Coverage 93.87% 93.90% +0.02%
==========================================
Files 37 37
Lines 3626 3608 -18
==========================================
- Hits 3404 3388 -16
+ Misses 222 220 -2 ☔ View full report in Codecov by Sentry. |
Thank you @znicolaou, this is a great feature! I'll review the code today or tomorrow. Do you have a simple test you could add for this? I would defer #222 to a later PR, as it would help keep this PR simple. |
Thanks, @Jacob-Stevens-Haas ! Yes, will revisit the tests and the example notebooks in the next day or two! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like such a great feature for a lot of reasons. It expands the ability to use WeakPDELibrary
for nontrivial problems, it avoids using special dunder attributes like __code__
, and it simplifies the logic considerably.
Lets go ahead and deprecate the old behavior! See the comment for a suggestion.
if self.library is not None: | ||
lib_names = self.library.get_feature_names() | ||
else: | ||
for i, f in enumerate(self.functions): | ||
for c in self._combinations( | ||
n_features, f.__code__.co_argcount, self.interaction_only | ||
): | ||
lib_names.append( | ||
self.function_names[i](*[input_features[j] for j in c]) | ||
) | ||
feature_names = feature_names + lib_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if, in initialization, we adapt the old way into the new way? Then we could get rid of all the self.library is not None
conditionals and deprecate the old functionality.
e.g.
if library is None:
self.library = CustomLibrary(library_functions=library_functions, function_names=function_names)
warnings.warn("Use the `library` argument instead of `library_functions` and `function_names`. "
"Pass a `CustomLibrary` if you truly want custom functions.", DeprecationWarning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a good idea. It'll take a bit to make the example notebooks reproduce the old behavior, which I think it necessary before we merge, but I think it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, is this why you added TrimmedLibrary
?
Last point about deprecating--I think we should probably implement the same change in |
…er from unnecessarily copying Theta to conserve memory for PDEs
…edLibrary; merged update from master
…edLibrary; merged update from master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for spreading this change to PDELibrary! But this adds a lot of new changes as well. I'm OK with skipping deprecation and going right to removal of library_functions
- after all, it does remove code and tests, and that's a good thing. But why add a TrimmedLibrary
class?
@@ -1,12 +1,12 @@ | |||
# %% [markdown] | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you modify the notebooks, then save as .py? The preferred way is to modify the .py, then run publish_notebook.py. This keeps the metadata and formatting the same between different notebook users, which minimizes diffs, e.g. # %% [markdown]
vs # # coding: utf-8
. It also keeps the nbdiff smaller by limiting changes to the cell and notebook metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good to know about the publish_notebook script, I didn't keep track of that before. I'll do it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are minor enough changes that I'm happy to do it on my end, if you don't mind?
for x in x_full: | ||
xp = self.library.transform([x])[0] | ||
xp = np.delete(xp, self.drop_inds, axis=xp.ax_coord) | ||
xp = AxesArray(xp, comprehend_axes(xp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it work to create a new AxesArray
with xp = AxesArray(xp, axes=xp.axes)
so we don't rely on the conventions in comprehend_axes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question: I copied the format from the ConcatLibrary and made minimal modifications. I think it would work fine that way. As a side note, there's a lot of old code using AxesArray in inconsistent ways. It would be great to go through and ensure that the we don't rely on the old conventions of axis=-1 for ax_coord and axis=-2 for time, but I haven't had time to look through it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree it would be nice, but it's probably not something anyone needs to put time into until it causes an error.
@@ -103,12 +85,13 @@ class PDELibrary(BaseFeatureLibrary): | |||
|
|||
def __init__( | |||
self, | |||
library_functions=[], | |||
function_library: BaseFeatureLibrary = PolynomialLibrary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mutable default values should be set in __init__()
. Since PolynomialLibrary
is mutable, this has surprising side effects. Check this out:
from pysindy import PDELibrary, SINDy
import numpy as np
data_1 = np.zeros((10, 5, 1))
data_2 = np.zeros((10, 5, 2))
model_1 = SINDy(feature_library = PDELibrary()).fit(data_1)
model_2 = SINDy(feature_library = PDELibrary()).fit(data_2)
assert model_2.feature_library.function_library is model_1.feature_library.function_library
# is true
model_2.print()
model_1.print()
# IndexError because the inner function library is shared, and has been resized to second input
This is because the call to create default inputs like PolynomialLibrary()
occurs only once, when the __init__()
function signature is read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks!
@@ -157,11 +135,11 @@ class WeakPDELibrary(BaseFeatureLibrary): | |||
|
|||
def __init__( | |||
self, | |||
library_functions=[], | |||
library=None, | |||
function_library: BaseFeatureLibrary = PolynomialLibrary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above - create it in __init__
|
||
# This calls check_X_y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: This is the kind of comment that is perhaps better in a commit message explaining the change from check_X_y
to _preprocess_data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right, sorry about that.
@@ -93,7 +92,7 @@ def __init__( | |||
max_iter=20, | |||
normalize_columns=False, | |||
initial_guess=None, | |||
copy_X=True, | |||
copy_X=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change? False
means that the optimizer mutates the arguments that get passed, which feels like surprising default behavior. E.g. if someone is compares two models using the same numpy array, this will silently update the data in between fitting the two models. This is why the superclass sets the default to True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The matrix being copied here is the output of the library.transform function after (passing through the SampleConcatter transform), which I don't think will ever be shared between two fits unless you hack it. For PDE data, this matrix can easily be more than 10GB in size, and copying it is very costly, especially when the copy does nothing. Looking at sklearn/linear_model/_base.py and __init__
here, this copy_X
is only used in the _preprocess_data
. When __fit_intercept=False
as is default here, _preprocess_data
does no centering, no rescaling, so the copying seems to do nothing. Nevertheless, this was causing spikes in memory usage that can crash examples when memory is low. If there's a good reason to copy the data here, happy to note that you can set copy_X=False
to conserve resources instead, but I'm pretty sure it's doing nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's helpful for copy_X
to be false when fitting large data, but think the docstring is the place to note that option. Changing the code would apply to all subclasses. Making the default True
would silently change the behavior of any estimator that modifies the arguments to _reduce
away from the convention not just in scikit-learn, but across all of PyData (e.g. numpy functions where inplace=False
is default). It's general practice that public functions should not mutate arguments.
Our sparse estimators have a public API, and I don't think using one outside of a pysindy.SINDy
model is really a hack. It's designed behavior.
) -> Tuple[cp.Variable, cp.Expression]: | ||
) -> Tuple["cp.Variable", "cp.Expression"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from __future__ import annotations
should take care of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just merged @akaptano's fix that he pushed in master. Can do this instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, makes sense. No worries - either way is fine, as SCA mostly reads quotes as forward references.
@@ -506,6 +506,117 @@ def calc_trajectory(self, diff_method, x, t): | |||
return self.libraries[0].calc_trajectory(diff_method, x, t) | |||
|
|||
|
|||
class TrimmedLibrary(BaseFeatureLibrary): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use case for adding a TrimmedLibrary
class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we made the EnsembleOptimizer
, we lost some functionality in notebook 13, which I noticed when I went through the example notebooks now. It's also useful for implicit SINDy to be able to mask a feature library term if you want to use it in place of x_dot
, as in SINDyPI
. I added it here just to get the example notebook 13 running again, but it is useful to be able to construct a library then remove specific terms from it. Maybe it would be better to make another pull request, but I only had a week where I could focus on pysindy, so I snuck it in here. I have to turn to other work now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The quick fix, if you don't have time to focus on pysindy, is to either let example 13 be broken, or remove TrimmedLibrary
from here and put it in notebook 13. The enterprise-level fix is to remove it from this PR, add an issue for what was broken in 13 and why a TrimmedLibrary
is a useful solution, and handle it separately with a new test.
Broadly speaking, I don't think we don't need to maintain the notebooks unless they implement a change for fast testing in #203 (see README.rst). It's too much of a burden otherwise, since they take a huge amount of time and memory to run. When someone runs into an error, they can make an issue and we can build a test, a fix, and if someone has the time, implement the maintainability fixes from the #203.
The broken notebooks serve as an example of what pysindy can do, even if they don't show the modern syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a terrible practice to break the example notebooks. That compromises recently published work that other groups may be trying to reproduce and can ruin the reputation of recent graduates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that reproduciblility and supporting recent graduates is important. But "How to keep one piece of software perfectly reproducible while working on another version" is a solved problem: add a requirements for each of the notebooks, and use a separate environment to satisfy those dependencies. If you don't have pinned requirements, you don't actually have something reproducible. At the bare minimum, just put the version of pysindy that the experiments require at the top of the file: 1.7.4 or 1.7.5.
I didn't think you or Alan were that attached to the notebooks all being able to run, since when I started contributing a couple years ago, some of them were already broken. Nevertheless, I do care about supporting your and Alan's research longevity, and that includes across major releases. So I built #203 to make it easier for you and Alan to protect the functionality of your notebooks across major versions (in particular, see the readme part).
I'd be happy to have a discussion about examples vs full research papers as notebooks in the repo, as well as whether we should always should maintain backwards compatibility of a notebook across major versions even if they could have fast tests. When does reading a bit of updated documentation or troubleshooting an error become negligible compared to reading the notebook & paper in the first place? It's not necessarily germane to this PR - we can move that to discussions if it's better there.
Nevertheless, TrimmedLibrary
doesn't make any code faster or cleaner inside the repository and it's only use case is in Example 13. As I said, the quick fix for this PR is to move it to examples/13_ensembling/utils.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS What functionality had been lost from example 13?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to be respectful, but discussions like this are not a good use of resources. I can spend time explaining decisions to you that are important, but that's not really my job.
You can open the notebooks, read the papers, understand what they are trying to do, and figure out why I made the changes if you want. You'll see that the features in the TrimmedLibrary are exactly those that were lost in notebook 13, but getting the notebook running is not the only reason to include it. I explained the use case with SINDyPI and ensemble SINDy already above, and I wanted to be able to test those things on research questions.
I know you want me to make another pull request for it and you will weigh in on its value, but I don't want to waste a bunch of time writing about trivial changes and discussing them. If you disagree with my decisions, then I will hear out your rationale, but the costs of contributing here are outweighing the benefits for me.
This repository is not an enterprise endeavor, and good scientific practices (particularly, not wasting time and computing resources) are at least as important as enterprise coding practices here in my opinion. We can have a difference of values, but we shouldn't let that become a barrier to collaboration.
I think this is done now, pending any other issues. Can take a look once more, but I have other work that I need to focus on now. If things look okay, happy to merge any time. |
If you need to move onto other things, would you like me to take this PR over the finish line? |
With that PR and issue, are you ok if we close this, Zach? (But let's keep the branch around so that we can pull in |
As a partial response to issue #351 and discussion #429 here, I'm adding an option to replace the
library_functions
andfunction_names
input ofWeakPDELibrary
withlibrary
, which is an existing instance ofBaseFeatureLibrary
. I've only tested so far on the first example in the example notebook 12, which reproduces the previous result:I'd also like to take some time to look through the PDE example notebooks to clean them up and make sure deprecated input has been removed, so I'm making this a draft for now. If anyone has other suggestions on modifications to the weak/pde input or functionality, this is a good chance. I may also look into #222 if it isn't a huge headache.