Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package maintenance #61

Open
ejolly opened this issue May 6, 2020 · 3 comments
Open

Package maintenance #61

ejolly opened this issue May 6, 2020 · 3 comments

Comments

@ejolly
Copy link
Owner

ejolly commented May 6, 2020

Looking towards the future

Please contribute if you can!

Developing and maintaining this package and so far been primarily performed by myself. I've had some helpful contributions from various folks, but currently any new changes, issues, bugs, etc ultimately fall to me.

While this is to be expected when starting an open-source project, it has become increasingly difficult to maintain. This is primarily because subtle changes in new versions of rpy2, the library that pymer4 depends upon, result in breaking changes more often than not.

Currently the issue list for rpy2 is 200+ between their github repo and their bitbucket repo. Many of these are installation issues, but many are not the fault of rpy2 developers, but rather numerous changes to R packages (e.g lme4) and R language itself. While R itself doesn't change that often, version updates and how attributes, etc are stored in various packages do (e.g. switching from lsmeans to emmeans). I don't envy their job keeping up with these changes, but I can't help but feel like I'm getting a taste of it too. The way that many issues manifest for pymer4 users often entails hunting down how some data-structure or function calls have been updated on the R side of things (e.g. #60).

I'd love to keep maintaining this package, but I'm trying to think about the most sustainable way to do so moving forward. If you have suggestions, even if that doesn't involve direct contributions, I'd love to hear them! In the meantime an updated release will require dealing with compatibility issues noted in #58 at minimum. Some of these have already been started in PR #62.

A few options I've considered (I'll add more as I think of them and receive suggestions):

Option 1

Create a docker container with a known working version of pymer4, Python, R, and all their dependencies

Pros

  • It would be relatively easy to install without any unexpected breaking changes since everything would exist in a frozen isolated environment
  • Bug fixes and additions to pymer4 would still be possible

Cons

  • There would be very infrequent/no new updates from R, lme4, Python, etc
  • Being isolated from the rest of one's system install means that any other packages, tools (e.g. jupyter notebooks) would either have to be included in the container or linked to from outside (if possible) adding more complications and not making pymer4 an easy solution to simply drop-in to one's existing analysis workflow.

Option 2

Rewrite the code base in a way that is robust to future changes. The current code base can already use some improvement and is very reflective of my learning trajectory in Python package development. The primary way I can see this happening now, is to simply make direct calls to the R language by writing R code from within Python rather than relying on rpy2 objects, methods, and classes.

Pros

  • This could be relatively robust, assuming there are no syntax changes in how to perform certain operations in R itself. Even if so those would be relatively easy to change.

Cons

  • This would take a significant amount of time for myself alone, but could go faster with community contributions
  • We would still need to figure out the most optimal way of extracting the required outputs from the R calls that are similarly robust to version updates

Option 3

Freeze rpy2 dependencies only and sync major pymer4 updates with rpy2 updates.

Pros

  • This would make maintenance much easier

Cons

  • It kind of kicks the bucket down the road as eventually big updates will have to occur to account for changes in rpy2
  • At the same time it could result in slower critical updates in case rpy2 pushes changes because something in core R changes

Option 4

Archive the project in its current state 😢

Pros

  • Similar to option 1, there would at least be a fully functional version on github for the community, but it would be left up to users to figure out the best way to perform a conflict free install on their own machines
  • Time cost is minimal

Cons

  • Clearly my least favorite option and one that essentially abandons the project
@turbach
Copy link
Contributor

turbach commented May 30, 2020

Hi all, first off thanks @ejolly for all the hard work, I would like to help out.

I recently started unpinning from Py 3.6 and numpy 1.18 and moving up to pandas 1.0+, and rpy2 3+. While troubleshooting a pymer4.Lmer multiprocessing.Pool issue in 0.7.0 I may have come across the cause (robjects.NA_Character) and a simple fix related to #57, will test further in coming days and post about those issues separately from this.

edit: see PR #63, #64

As for general maintenance, our lab runs a (very) mixed Python and R stack on CentOS linux that includes an open-source project fitgrid that depends on pymer4. So we are quite familiar with dependency whack-a-mole. Our approach is a bit different though and we have gradually migrated away from pip and now use conda (miniconda3, conda-build) for virtual envs, installation, and packaging (and are not the only ones ... NVIDIA Rapids).

The Pymer4 installation docs recommend conda installs for r-base and friends and then switches to pip for pymer4. Our past experience using conda and pip installs in the same env has not been good and we no longer mix the two. In practice this means we only conda install into conda envs. Conda envs have some of the virtues of containerization but the main selling point for us has been the conda install dependency solver precisely because it helps negotiate consistent versions of Python (pandas, numpy, scipy, rpy2, jupyter, ...) and R (lme4, lmerTest, tidyverse, rstudio, ...) packages in one env.

If you're on board with conda for R what about conda packaging pymer4 as well? We packaged a version in house to get our stack to work (pymer4, details on request).

We also carry the conda-ification through the CI. Instead of having Travis pip install and run pytests, we have Travis conda-build the package and then conda install it into a fresh conda environment and run the pytests in that. So the CI checks the pytests in an env populated with the latest greatest versions of what ever compatible dependencies the conda dependency solver comes up with that night with the package installed from the same .tar.bz2 binary a user would conda install from Anaconda Cloud. For releasing, Travis deploy is setup to trigger on v.N.N.N tags and rebuild the sphinx docs and upload the fresh built and tested packages to Anaconda Cloud and PyPI. So dropping a semantically versioned release on github automagically updates and syncs the docs and repo packages with the right version number.

The CI also helps with dependency whack-a-mole. Travis sends an email if there is a breaking change somewhere out there in the unpinned dependency world which is what we really want to know. We hope this doesn't happen too often. When it does, diagnosis is to check the conda list dump in the travis log to see what has changed from the last good run. Triage the problem. If it is obvious and easily fixed in the project code, fix it. If it is hard or in some dependency we don't control, do a minimal pin to the last working version of the offending dependency(ies) in conda/meta.yaml. Either way, pull the hotfix to master and do a github v.N.N.N+1 patch release to refresh the repos.

This keeps pytests passing in Travis and in the latest release in the public repos while buying time to defer major overhauls which might be tackled better down the road, say after a big dependency like pandas does a major release. A potential advantage of minimal pinning vs. freezing and containers is that the many dependencies with non-breaking changes are allowed to evolve so the rest of the environment stays as current as possible.

Obviously none of this fixes bugs or solves the problem of keeping up with breaking changes in rpy2 and R. And I should add that this is only for Py 3.6+ and linux64. Perhaps OSX is in reach but I don't have time for that let alone Windows or 64-bit exotics or 32-bit anything. And there's no free lunch, conda packaging brings new problems like wrangling channel priority and official conda packaged versions tend to lag PyPI. But my take is that conda does offload some of the dependency headaches to the conda solver and it makes the package much more likely to mix and match and install smoothly and reliably with other packages at least in conda envs.

My 2c. Thanks again.

Tom

@ejolly
Copy link
Owner Author

ejolly commented May 30, 2020

First off @turbach WOW. I can't express enough how much I appreciate your offer to help and this detailed comment!

I've haven't run across many labs that actively maintained a mixed Python and R stack so your experience here is immensely helpful. A conda package for pymer4 was actually a long-term goal for me! However, since I don't have much experience with building a conda package, my original plan was to essentially make one by converting the pip setup I already had based on the documentation here. After thinking about how to incorporate the R side of things though I realized that a quick conversion wasn't really going to quick at all so I didn't pursue it any further and haven't had a change to revisit it since. The fact that your lab already has a conda package for pymer4 in house blows me away! (side note: apologies if I'm gushing a bit here, but I'm always a little amazed when pymer4 is so actively incorporated into other libraries/setups since I'm mostly only aware of end-users of the library. Seeing it in software stacks like yours is so galvanizing as an independent OSS developer.)

Suffice to say I'm very much on-board with conda packaging for pymer4 as well. I know many users already rely on conda for their scientific computing stacks so this would given them the flexibility to install it into their existing environments, or create separate ones if desired. Your CI workflow seems amazing as well! My current flow is very inefficient especially with regards to documentation updates. For other projects I've typically used sphinx + readthedocs, but again the mixed language nature of the project means that RTD config now requires an R installation to build the tutorials etc which was a huge pain. Instead I just build the docs locally, and push them to a gh-pages branch on pymer4's project repo which redirects to my personal website url.

I'd love to see the details of the config files and any notification hooks from Travis or receive any assistance on this. Offloading as much of this as possible, as you described, to Anaconda's solver + Travis would make things a heck of a lot easier.

On an unrelated note, I was also completely unaware of fitgrid which looks fantastic, as does the general "mass-univariate" approach applied to EEG in Smith & Kutas, 2015. Cosan lab, in which I work, primarily focuses on fMRI but also some intracranial EEG, so I'm going to share with some lab mates who may be interested in using multi-level modeling across sensor channels. We maintain a separate library primarily for fMRI analyses called nltools, but we've had discussions about extending it in different ways and multi-level modeling always comes up. It's exceedingly rare fMRI settings but definitely seems important particularly with respect incorporating item/trial level random effects terms, e.g. Westfall, Nichols, & Yarkoni, 2016.

@turbach
Copy link
Contributor

turbach commented May 31, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants