-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add R packages from the Galaxy TS #574
Conversation
Hi Björn - We've been separating bioconductor packages from CRAN packages using Check out the |
@daler thanks for the info. I have seen this and will use this for new versions. What I'm trying here is to migrate Galaxy versions to biocondoa. This means that most of the packages are old ones and have old dependencies. The aim is to keep the installation identical to the Galaxy once, so every single package with the exact version. It will get more crazy if I migrate more packages. This was only a test - more to come :) Do you think this is a workable solution as migration path? I will change the prefix in my migration scripts, to fit the naming schema. Talking about the deseq2 package it has some additional dependencies on rjson that are needed for Galaxy integration. Do you have and idea how we can solve this? New namespace, maybe |
@daler I changed the namespace and added a |
Ah, I see. Is the If the latter, can the recipe go in older-version-subdirectories as you've done in the other parts of the migration you're doing? That way others could easily find the older DESeq2 versions without having to know they need to add the I guess it boils down either wgetting all specific dependencies in the But, as you mention, you need to be able to pin versions in the bioconductor recipe scripts to make this work easily. This would be a really helpful feature and it should be simple to add. If a version is specified on the command line, the skeleton script can grab it directly from bioaRchive (as long as that version is available). |
I deleted my previous comments because I misunderstood sth. |
No, no patched version. Everything is the standard package, mirrored to various places. Because of this we created the bioarchive and the corgo-store. It just a different mix of package that we would like to keep forever.
It already is in a subdir, should I override the version you have currently with this one, increasing he build number?
I totally agree, it's just a little bit hard to make this for all old packages we have currently for little gain, because most of them are old.
Sure, I will add this. Especially if we create new packages primary with conda and not with our own build-system. The real problem comes into play with our strict versioning. The current R packages don’t have this and it makes it easy to update those. But if you add a strict version requirement you need to update all DESeq2 dependencies in a new version and the dependencies thereof. In reality this means with every update of DESeq2 all packages needs to be updated (if you want to have the latest version of all dependencies). This is hard to do, that the reason we are shipping such bundles in Galaxy. The entire bioaRchive project is related to this issue but also this PR: bioarchive/aRchive_source_code#20
Yes I think so. But do we want to have so many versions? My guess is that most of the small dependencies (+version) are only used once for this specific package. Btw. I have the same problem with Python packages. If I depend on pbr ==1.8.0 and there is no package, should I add it to bioconda, wait until conda will accept a PR or include it in the build.sh script? Requiring strict versioning is such a pain :( |
Please see my other post. This is hard to archive if you want to have strict versioning and will create a massive amount of new packages. Moreover it means we will have packages in bioconda that are already in conda with a different version. That said it can be done. Just not sure if it's worth the effort. Regarding your comments about the package manager. This is not so uncommon, remember your latex packages. Debian is also shipping them in larger units, and as Fedora changed this there was a huge discussion. It just makes it unnecessary hard, especially if you require strict versioning. |
Sorry I missed that the
Metapackages might solve this. A
I'm not familiar with the dependencies in Galaxy. To use the IRanges example, does this mean that different Galaxy bioconductor packages need different IRanges versions? That is, Galaxy is not frozen to a specifc bioconductor release, but rather each package uses a different custom set of version dependencies?
The model we've been using is to add it to bioconda.
No kidding! |
Metapackages might solve this. A Yes, meta-package is probably the correct name.
Galaxy has no concept of a BioC version. Every tool should be reproducible, so the DESeq2 Galaxy tools, depends on a specific R version, with a specific Bioc version with ... hence the bioaRchive idea :)
Great, than I will add basic python packages as well.
@daler @johanneskoester to exaggerate a little bit you are voting to have every R package in every version in bioconda, is this correct? Would this be even possible from a technical point of view? Lets assume we have a package:
If conda can handle such cases I guess it would be easier for us to integrate this directly into bioaRchive instead of doing this on demand on pushing regularly to bioconda. To defend this PR a little bit, this does not mean we will not have a Thoughts? |
So in the example dependency conflict:
How do you currently resolve this conflict in Galaxy / bioarchive? As for all versions of all packages . . . I believe the limiting factor is the storage quota of the bioconda channel. But it also depends on Johannes' vision of bioconda (storage depot/archive vs latest packages). |
BioaRchive is just a storage and does not have this problem, it's a tarball storage.
But also a technical one if the above one really results in a conflict.
Yes :) I guess it boils down to if we envision complete reproducibility here. |
So, our original idea was, Bioconda should contain the latest versions and keep the old versions unless we run out of space. I don't think we should now start and add all old versions of all R/bioc packages. Regarding the strict versioning: is there really a need for this? Anaconda does not seem to do it for its own R packages and they work totally fine. I think Conda won't be able to deal with the conflicting situation you describe. I am pretty sure even R itself is not capable of something like that, isn't it? I would prefer to add specific fixed version dependencies only if they are really needed (e.g. known API incompatibilities). Regarding your need for specific versions in Galaxy, meta-packages or uploading anaconda environments with fixed versions are the way to go. In both cases I would think that a separate galaxy channel would be a good idea (only for the meta-packages or the environments, not for the packages themselves).
to get the combination of versions you need from Bioconda. |
As I said this was exaggerated :) and will not really happen, we would just need to add all packages that we currently maintain in the Galaxy community.
Upps, I thought this is ok and because in the readme it's written. I added a lot of packages the last days into subdirs.
Depends on your needs. If you want reproducibility yes. The question is usually how far you want to go, but I guess it's always nice to reproduce a manuscript that used some old R2 packages / tools in general.
I thought so :(
Oh no, this is the reason we do all this stuff. R is in this regard a real nightmare, since even the tarballs disappear after some time.
You never know beforehand and it's not really about API compatibilities it's about reproducible results.
Meta-packages only work if I put every single R package I need in a specific version into the channel and we need the guarantee to that they keep available.
See above, in this case we need more than the most recent version. |
Is this only regarding R packages or all software. I'm happy to open a Galaxy channel for R or Python - because these are libraries and what we do is a bit peculiar. How about general bioinformatics applications though? The dependency hell you land in when you maintain older versions of software is not as bad in this case. There will be a lot more stuff added to your and you may be uncomfortable with that in some ways - but it is a large community of developers who work hard to maintain recipes, many new users, and stronger claims of reproducibility that you will gain in the process. |
Björn, John, Johannes and Ryan; Could we solve the issue of reproducibility by having Galaxy install from conda dependency files? So for ballgown, instead of that giant recipe you'd have old packages installed separately and do:
and install via those without needing to add the specific dependency versions to the meta.yaml file (which I agree will probably make conda start on fire). This way we don't need to maintain all package descriptions separately for all versions for all time, since the pre-built old versions will be available in anaconda.org to reproduce and we never need to build them again. We might need to figure out with Continuum about storage as we get more but I assume they will accept money understanding the noble purposes of Galaxy. You'll also need to do some insane amount of back-porting to get in all the old versions you want, but I know y'all like to torture yourselves so that should be fun. |
Hi Brad! I'm not sure I understand your proposal. Assuming that DESeq2 requires R 3.1, the following code does not work, because of conflicting dependencies:
This was one of the reasons we decided to create such collection-packages in Galaxy. We can be sure they are working in the composition we are shipping them (R-version + Package versions ...). In this regard it is not different from the Anaconda installer. Admittedly, this is made for users that want to use ballgown/deseq as tool and not as library and I understand if this does not fit the bioconda philosophy. @daler @johanneskoester @chapmanb thanks for this discussion! I really appreciate this and I see that both communities have a slightly different focus. It would be awesome if we can find some middle ground we are all happy with. |
Depends on how dependencies in # bioconductor-ballgown
- bioconductor-genomicalignments >=1.0
- bioconductor-rtracklayer Then this metapackage would work: # ballgown-gx
- bioconductor-ballgown ==1.0.3
- bioconductor-genomicalignments ==1.2.1
- bioconductor-rtracklayer ==1.26.2 and so will this one: # ballgown-gx
- bioconductor-ballgown ==1.0.3
- bioconductor-genomicalignments ==1.2.0
- bioconductor-rtracklayer ==1.26.2 but not this one: # ballgown-gx
- bioconductor-ballgown ==1.0.3
- bioconductor-genomicalignments ==0.9
- bioconductor-rtracklayer ==1.26.2 Currently, the existing bioconductor recipes specify minimum dependency versions only if the original bioconductor package does. Since the author specified it, we assume that the package won't work correctly if the version is too low. But as long as the pinned versions in the metapackage satisfy those minimum versions you're good. Hopefully that clarifies how things are currently set up. But all of this doesn't solve the original example conflict where we want both Rcurl versions to be simultaneously installed so that
I think I'm missing something in this case though. Probably because I don't understand the internals of R library loading and how Galaxy is handling it. This seems like an unsolvable dependency. That is, in the Galaxy envrionment/repository capsule for the above environment, what Rcurl version am I using if I open R and run |
Yes, what Ryan and Brad describe is what I meant. There is no need to specify versions of R packages in dependency chains. A flat list of specific versions for all R packages will tell conda what to do. This flat list can be either a file (as seen above), an environment (as I proposed) or a meta-package. This is 100% reproducible, whithout making individual recipes less generic. I second Ryans last question. That seems not to be possible in R. |
@bgruening, @jmchilton, Galaxy having the need for a particular older version is totaly fine as motivation for adding them as subdirectories. What I meant was, in general, when a package is updated, we don't move the old version to a subdirectory, but rather rely on it staying available in our anaconda channel. That way, it stays installable without additional maintenance effort. |
Ha, ok I missed that This makes sense, thanks for the explanation :) Actually this is what we tried initially in Galaxy but it was not practical as you end up with a lot of packages you will never need, maybe this is different here. The maintenance overhead scares me a little bit in comparison to this simple PR. Before I try this, is there any way to offer packages for multiple R versions. This PR explicitly targets R-3.1 but I guess you want to have all this dependencies also for latest R. This is not super important as this will only happen for migration packages and if it is not possible we can have a galaxy-migrate channel or something similar. You don't want to have this meta-packages hosted here, right? Leaving the R discussion for a moment as this is really the worst case to discuss :) I really want to use bioconda for classical binaries, can we keep the subdirectories so that it is more obvious what is available and make it possible to fix old recipes easily? P. S. deseq ==1.8.2
- foo ==3.4
- Rcurl ==2.0
- bar ==1.0
- Rcurl ==2.1 This was an example that if you make a small mistake in setting the |
Björn;
Thanks again for all the discussion. |
Brad; Sorry to make you all so much trouble. Most of my concerns are about long-term sustainability and migrating old packages.
I just thought that if I add packages (for older versions) it should in addition also target newer versions of R, isn't it? This is not needed by us, more a general question how to deal with this to support the community.
There are a lot of different reason, also mentioned here: #612 (comment) From my experience a recipe is rarely perfect from the beginning, e.g.:
I know this is very special to Galaxy or reproducibility in general but if we care about this we should be able to fix old recipes easily.
Yes! But only for new packages and if nothing get's deleted and if we can rely on anaconda.org. conda create -y -c bioconda --name bx bx-python==0.7.3 numpy==1.9.2 pyyaml==3.11 This will install fine but creates an unusable bx-python. The reason is that bx-python compiles against numpy 1.10 as this was the latest version in the conda channel at time of creating the bx-python package. If you run bx-python Python will complain that this version was compiled against a different numpy version, which is true. It seems that even if you haven’t specified a strict version in the meta.yaml definition you implicitly have defined one which prevents you from recombining the different version of packages. Even worse, if this is correct we will not notice the error unless we run the tool.
I admit that this porting effort was to early and I caused you all to much pain. Maybe it's easier that we create our own repo for the old packages. Something like a IUC-migrate channel and store all this old cruft there. It's your call. What really worries me is the above example, I really hope I'm doing something silly here. If this is true, it means the order of submitting packages to conda determines the implicit dependencies and in a case where anaconda.org goes down we can not replicate the packages. Moreover the flat-file idea only works for a specific set of packages, not all. Thanks! |
Actually, conda has a way of handling this properly. As you can see here, when you add an Hence, we can easily fix the bx-python recipe to properly handle your example @bgruening. |
I have been looking into this. I think the place to start would be here. Currently, the mechanism for fixing a version from the build is limited here. A generalization would take the currently available version of each dependency, and add it to the metadata of the package if the dependency was specified with "x.x". |
@johanneskoester |
This PR contains the needed changes in conda-build to generalize the mechanism. I hope it will get the attention of Continuum soon. |
I will close this. Thanks for the discussion :) |
Conda skeleton pypi clarification
No description provided.