Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove research experiments from pysindy #461

Open
Jacob-Stevens-Haas opened this issue Jan 19, 2024 · 6 comments
Open

Remove research experiments from pysindy #461

Jacob-Stevens-Haas opened this issue Jan 19, 2024 · 6 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@Jacob-Stevens-Haas
Copy link
Member

Is your feature request related to a problem? Please describe.

Documenting the second part of @znicolaou and my discussion on #433, of which #449 was the first part.

This repository has served two purposes: one, as a central point for members of the research group to post jupyter notebooks of their published research papers, and two, to allow others to rapidly reuse innovations from SINDy research for comparison, method extension, or for actual system discovery. I refer to these desiderata as Extensibility, Reproducibility, and Reusability.

Ideally, there is no inherent conflict between these aspirations: as all packages from time to time need to be able to make breaking changes in order to keep the code maintainable, and one can achieve reproducibility for an actual research experiment by pinning all dependencies/using a lockfile. However, the experiments also serve as documentation about how to use the code in interesting ways (a mix of tutorials and how-to-guides). Further, they aspire to a more expansive definition of reproducibility, allowing someone to verify experimental results using the most up-to-date syntax. But because they are slow and not in CI, they're frequently broken by changes (most obviously by changing the contract between caller and callee - see #347, which removed the quiet argument from SINDy.fit(), but also occur more subtlely when a default value or return information changes (e.g. #458)).

As a side note, examples/ is huge, which makes any improvements to the build process or dependencies pointing to this github repo very slow, e.g.

"pysindy[misor] pysindy@git+https://github.com/dynamicslab/pysindy.git@trapping-plumes"

The situation is harder, as current notebooks do not have pinned, reproducible dependencies, so research isn't, strictly speaking, reproducible. It also becomes difficult to troubleshoot what change caused an error in a notebook as we do not CI the notebooks. Since running the notebooks takes ages, it's infeasible to promote small PRs and also run the notebooks every single PR. As a result, notebooks quietly break. That could hurt the careers of researchers who want reviewers and collaborators to be able to reproduce their research and don't expect version changes or are unfamiliar with software concepts like semantic versioning and refactoring.

Any long-term solution needs to be maintainable. It needs to allow researchers to keep their experiment up to date with pysindy major version bumps, if they want to put the work in. On the other hand, pysindy maintainers need to know whether a change breaks any calling conventions used in documentation.

Describe the solution you'd like

Notebooks that describe how to use a feature stay part of this repository, but under the README guidelines that ensure maintainability. Notebooks that are a complete research paper go to their own repo, but can still get linked in the documentation (or even built fully into the documentation). A good heuristic is whether an author is OK with maintainers changing the ODEs/PDEs, alternate methods, and parameters presented in the notebook. E.g. a demonstration of how to use a pysindy feature and where it should be used is a good how-to guide/tutorial that this package should maintain. An in-depth comparison of how different pysindy methods perform in bounding some property of some particular PDE system is a research notebook.

Implementation

  1. A template repository that lets people migrate their notebooks out of the pysindy repository. It should make it easy to guard against breaking changes in the pysindy API. When things break, notebook owners can either update their code, add an issue in pysindy if the break wasn't intended, or cap the version of pysindy for the notebook. The repo should also provide both a place to pin the honest-to-god reproducible requirements as well as a place to accept a range of versions. Finally, the template should include sphinx configuration to build the notebook into a web page.
  2. Once that's done, stop accepting PRs that add deeply involved "example-but-also-publishable-research" notebooks. Instead, people can fork the template repo, add their notebook there, and PR a link to the built documentation in our examples/README.rst. If they want to provide the expanded "reproducible in current major version", they're responsible for the maintenance of their own repo. Otherwise, it's ok to pin the dependency to a single pysindy version. If someone still wants to leave a true example notebook to make their contribution more reusable, they can PR one according to the guidelines in the README
  3. Create a github workflow here to run existing notebooks once per week/month/quarter/version-bump. This will alert us which notebooks are using syntax that falls current version, and we can alert the authors.
  4. After notification, authors have one year to implement the new pattern. Any broken notebooks will be removed from this repo and placed in a fork of the template repo, pinned to the version of pysindy in existence when they were first introduced and not touched. Original authors would be given full privileges on that repo.

Describe alternatives you've considered

Keep the notebooks but let them break. Checking compatibility of any patch to an ever-growing list of notebooks is infeasible, and maintaining the advertisement of someone else's research is the job of only the author. At the same time, setting up the infrastructure for researchers to plug into pysindy development is a bit of work as well.

Fix it fast. I am also ok jumping from 3 to 4 without any waiting period, leaving the addition of how-to notebooks for future documentation PRs.

Additional context

CCing everyone who's been involved @akaptano @znicolaou @eigensteve @briandesilva @jcallaham @MPeng5 @kpchamp @OliviaZ0826 @yb6599 @Ohjeah @nathankutz

I'm not sure the propsed solution is correct, and I want to get people's input.

@Jacob-Stevens-Haas Jacob-Stevens-Haas added enhancement New feature or request question Further information is requested labels Jan 19, 2024
@akaptano
Copy link
Collaborator

akaptano commented Jan 19, 2024

@Jacob-Stevens-Haas Thanks for the very thoughtful post -- I agree with essentially everything here. I wonder if there is a way to have a pysindy repository that is stable (we can strip out all the researchy-topics like the trapping functionality) and a pysindy repository that is for development by researchers. My worry with this is that if every person who makes a new change to some SINDy method creates a new fork/repo of PySINDy, it will be hard to keep PySINDy up-to-date with the latest and best variants. But you're right that we also don't want every person with a half-baked idea and okay coding skills (talking about myself here :)) making updates that break a bunch of other things in the code.

For a historical perspective, the state of the Kutz/Brunton lab before PySINDy was that every new grad student and post-doc wrote their own code full of bugs and conflicts with other codes, so it has been a real boon to getting these SINDy-based methods working without reinventing the wheel (and to me, the "wheel" nowadays should be using weak form SINDy with ensembling...).

On a related noted, I have been wanting to get a version of PySINDy implemented that uses the multi-step prediction error as the loss term, but this requires totally circumnavigating the whole sklearn pipeline, so I have been considering writing a repo based on PySINDy but with all the sklearn stuff stripped out (as well as some of the other "extras" that I think are unnecessary... like multiple trajectory concatenation or non-PDE libraries since the PDELibrary does everything now anyways). This is anyways what @KKahirman and Joe Bakarji had to do for their projects with the multi-step prediction error, but their codes are not compatible currently with PySINDy functionalities. I would really like to do multi-step prediction error with ensembling and weak form and all the other advances. Happy to discuss these topics more here if anyone has more thoughts on this!

@jcallaham
Copy link
Collaborator

+1 from me on the proposal. From my end, the main notebook I contributed was the cavity flow one. I'm not sure if a (stripped down) version of that would be good to keep as a "neato" fluid dynamics case (no fancy features or dependencies, just basic SINDy). Should I open a new ticket to discuss that?

In brief, that notebook is not really serving any of the three main purposes you highlighted, so I'd propose either

  1. Get rid of it
  2. I'll revise and significantly simplify it so it would be fast to run in CI and easy to maintain

@Jacob-Stevens-Haas
Copy link
Member Author

Jacob-Stevens-Haas commented Jan 21, 2024

Thanks for your comments guys. To clarify, @akaptano I do think the advanced research methods like Trapping SINDy have value in the repo... best case in point is that I only discovered its issues because it was an attractive method for me as a user in another project 😅 I think we can provide better stability by bumping versions more frequently and faithfully to SemVer. Personally, I'm wary of bumping and releasing a version because I feel uncomfortable with a version release that breaks notebooks, which is part of the reason for this post.

As for multi-step prediction error: I also would love to have that too, and currently feel awkward writing about the heuristic score of "how the discovered system looks". We could have an issue dedicated to that so we can link to use cases and people who have rolled their own implementations. I couldn't see what @KKahirman did, if it's in his public repo.

@jcallaham I think that's a great example in its full glory. It might have been confusing in the issue post, but I can handle a more robust version of (1) that still builds the notebook into the documentation, but as part of a separate repo that depends upon a pinned pysindy version. (2) has a simpler option:

Save it as examples/14_cavity_flow/example.ipynb, save it again as a python file examples/14_cavity_flow/example.py. Then, have a cell up front that determines input data and important constants, e.g. integration tolerances, integration times, data source.

if __name__ == "testing":
    data = {
        "t": np.arange(0, 1, .1),
        "x": np.random.normal(size=(<small array of correct ndim>)
    }
   t_sim = np.arange(0, .5, .1)
    ...
else:
    data = loadmat("data/cavityPOD.mat")
    t_sim = np.arange(0, 300, dt_rom)

The repo currently configures pytest to skip notebook scripts locally, but include them in CI.

@Ohjeah
Copy link
Collaborator

Ohjeah commented Jan 24, 2024

Hi, thanks for the pin.

I agree with @Jacob-Stevens-Haas: complex examples should be move to a different repository and each example should have a set of pinned dependencies.

@Jacob-Stevens-Haas
Copy link
Member Author

An aside - for those who have multiple notebooks in a single directory, sphinx chooses a notebook to build into the documentation randomly (well, sphinx's sorting of filenames).

@briandesilva
Copy link
Member

I'm very late to the party, but I also agree with the solution that @Jacob-Stevens-Haas has proposed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants