Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Package scripts in a Conda repository #2

Open
bebatut opened this issue Oct 9, 2024 · 20 comments
Open

Suggestion: Package scripts in a Conda repository #2

bebatut opened this issue Oct 9, 2024 · 20 comments

Comments

@bebatut
Copy link

bebatut commented Oct 9, 2024

Hi there!

I’ve been using the scripts in this repository, and they’ve been very helpful. I wanted to suggest packaging the scripts into a Bioconda repository to make them easier to install and manage, especially for users who want a consistent and reproducible environment.

By providing a Conda package, it would be much easier for us to install the scripts and keep them up to date. I'd be happy to help if needed!

Thanks for considering this suggestion!

Bérénice

@bluenote-1577
Copy link
Owner

hi @bebatut

I have had this request before, but it's a bit weird of a packaging situation.

The main contribution in this repo is (1) scripts, which can be conda packaged easily, and (2) a set of metadata files. It's a bit awkward to package metadata files in a conda repo, because you'll have to access them e.g. python script.py METADATA.TSV, but conda obfuscates where your files are located.

To me, it's simpler to have it all in a repo, where it's obvious where your metadata files are located.

Let mek now what you think.

@bgruening
Copy link

I agree, the metadata files are large and should not be shipped with a package. There are a few possibilities other packages with a similar problem are doing:

  • using a post-processing hock in conda to download the "data" during tool installation
  • offering a download option as part of the tool, e.g. python script.py --remote-metadata URL
  • using an ENV variable to point to the "data" and let the user deal with downloading the "data" on its own

Do you think anyone of those will work?

@bluenote-1577
Copy link
Owner

@bgruening

These sound reasonable, but I'm trying to understand the motivation for this. Why is not cloning the repo sufficient, or something like wget https://github.com/bluenote-1577/sylph-utils/raw/refs/heads/main/virus/IMGVR_4.1_metadata.tsv.gz for downloading single metadata files?

@jfy133
Copy link

jfy133 commented Nov 28, 2024

@bluenote-1577 sorry to chime in, but it's a good question and a colleague of mine actually already started trying to make a conda recipe before seeing this issue as we are also looking to make a wrapper (but for nextflow).

@bgruening is by far the expert here when it comes to conda, but from my experience forcing data to come with software in a conda recipe can be really annoying and a blocker for some people (even if it is small!). When it comes to (large) data people often want to place this in a cache location to allow re-use across different contexts, and/or allows you to delete a conda environment and start again if something breaks.

Also having to include the data within the conda package means this is much larger and bloats conda's own cache on a persons machine, and can sometimes for people with slow internet connection can take much longer (where e.g. one person could download the data to a cache, and then everyone else can quickly install the software in their own environments as required, and not have to download the data every single time as they can share)

Another benefit of allowing placing data outside a conda environment itself is also for developers, as during CI testing they can download the data once and then for each test create an environment of just the software (which is fast if it doesn't come with data), thus speeding up the tests.

In that regard, if I would have a say in the matter, I would vote for option two or three of @bgruening 's suggestions :)

@jfy133
Copy link

jfy133 commented Nov 28, 2024

In this case, as your metadata files are small, the ENV variable would be easiest, then you could include the metadata files during the clone, and maybe pre-set the variable within the conda environment to point to wherever the files are in the conda environment

@bluenote-1577
Copy link
Owner

@jfy133 Thanks for helping out.

I'm not an expert that conda, so from trying to understand your comments, what seems to be a good approach is to

  1. Set up a conda recipe for this repository that:

  2. Pre-set the some sort of ENV variable pointing to some conda environment directory e.g. /home/user/miniforge3/envs/my_env/sylph_metadata_files in build.sh (we can use the CONDA_PREFIX env variable too?)

  3. Copy the metadata to the sylph_metadata_files for the build.sh

  4. Modify the scripts to look at the ENV variable directory so the user can just type sylph_to_taxprof.py -s sylph_output.tsv -m gtdb_r220

I know that @bgruening suggested not packing in the metadata files into the conda package, but they are relatively small ( ~50MB for all, I think)... let me know what you think.

@bgruening
Copy link

I don't think 50MB is not small. Those packages will end up in all versions of the package, for all different architectures in all containers etc.
You could put your metadata on zenodo, get a versioned DOI for it. Your program would need to be extended to support a arbitrary location of the metadata, but I hope this is easy to implement?

@bluenote-1577
Copy link
Owner

bluenote-1577 commented Dec 2, 2024

@bgruening Fair enough.

Here is my plan for the future, that I will get around to in the next two months:

  1. Repack the scripts in this repository into a single python command with subcommands

e.g. sylph-tax taxonomy, sylph-tax download, sylph-tax merge for getting taxnonomies, merging, and downloading databases.

  1. sylph-tax download will download data index to ~/.local/share (or the corresponding folders for Mac/Windows) or a user specified location.

  2. Put DBs into some hosting service. Perhaps zenodo may work, it's a bit annoying to update though.

  3. Pack the scripts (without DBs) into a conda recipe.

I will probably create a new repository called sylph-tax since alll of these scripts relate to taxonomy -- also, I want to leave this repository as-is for backwards compatibility.

@bluenote-1577
Copy link
Owner

@jfy133 @bgruening just an update on this

I had some time and managed to get a working version. My new repository is https://github.com/bluenote-1577/sylph-tax.

I used ~/.config/sylph-tax/config.json to "point" the data to the downloaded location. The same metadata files can still be manually used. If you have any issues, let me know.

My conda pull request is bioconda/bioconda-recipes#52683. Il'l update documentation after it gets approved.

@bgruening
Copy link

Super cool! I will have a more detailed look tomorrow.

Do you see any way to support the database path via CLI argument? Or how are users supposed to create the config.json. Please note that is should be possible for workflows to influence this, without manual intervention. And to make things more complicated there are HPC environments without a HOME dir ... (https://academic.oup.com/gigascience/article/8/5/giz054/5497810)

Thanks a lot!

@bluenote-1577
Copy link
Owner

bluenote-1577 commented Dec 9, 2024

@bgruening

Hmmm, I wasn't aware that some HPC environments don't have a HOME...

Currently:

  • sylph-tax would create ~/.config/sylph-tax/config.json automatically.
  • If it can not, then the program proceeds without a config file
  • The "download" command will download files to a folder and populate the config.json.

For the main command, sylph-tax taxprof looks at config.json's downloaded path. If config.json can not be created, the user can manually input the downloaded files. But this would require slightly different arguments. E.g. instead of sylph-tax taxprof -t GTDB_r220 (with config) it would be sylph-tax taxprof files/gtdb_r220_metadata_files.tsv.gz.

  • Future: I will add an argument that points to downloaded files if a .json can not be created in $HOME (since I agree this would be the correct thing to do). Edit: Maybe not see comment below.

@jfy133
Copy link

jfy133 commented Dec 10, 2024

I think this might be another case for having an environment variable as argued about which is what sylph-tax uses for locating a config.json. By default this could pointing in the home directory, but a user could override this location before it's even generated.

But the second example of manually specifying the TSV would equally work for me.

@bluenote-1577
Copy link
Owner

bluenote-1577 commented Dec 11, 2024

I think this might be another case for having an environment variable as argued about which is what sylph-tax uses for locating a config.json. By default this could pointing in the home directory, but a user could override this location before it's even generated.

But the second example of manually specifying the TSV would equally work for me.

yes this is a good idea. I have now implemented the ability to set the config to an ENV var called SYLPH_TAXONOMY_CONFIG.

To summarize:

  1. sylph-tax looks at $SYLPH_TAXONOMY_CONFIG on startup. If this is not set, sylph-tax uses ~/.config/sylph-tax/config.json.
  2. sylph-tax will create SYLPH_TAXONOMY_CONFIG (or ~/.config...) if not present.
  3. sylph-tax download will modify SYLPH_TAXONOMY_CONFIG.
  4. sylph-tax taxprof will look at SYLPH_TAXONOMY_CONFIG to see where to look for downloaded files if fed an argument like GTDB_r220.
  5. Worst-case: the user can feed in the downloaded file as downloaded_files/gtdb_r220_metadata.tsv.gz. This requires no SYLPH_TAXONOMY_CONFIG but breaks abstraction.

LMK what you think; specifically, if you still want ability to do sylph-tax taxprof --MY_DOWNLOADED_FOLDER folder ... instead of setting SYLPH_TAXONOMY_CONFIG to somewhere you have write access.

@jfy133
Copy link

jfy133 commented Dec 11, 2024

I think that works for me!

With the variable you just need to set it once and your done, I think it unlikely you would need to move the config file around so no need to specify via an explicit argument (given you can just supply the tar directly if you really need).

@bluenote-1577
Copy link
Owner

bluenote-1577 commented Dec 12, 2024

@jfy133 bioconda/bioconda-recipes#52746 is done with the environment variable capabilities. I tested and it seems to be working on conda now.

@jfy133
Copy link

jfy133 commented Jan 6, 2025

Wonderful, thank you very much @bluenote-1577 !

@sofstam, you can try it out now for the nf-core module :D, and if it works please report back :)

@sofstam
Copy link

sofstam commented Jan 6, 2025

Thank you very much @bluenote-1577!

@sofstam
Copy link

sofstam commented Jan 30, 2025

Hi @bluenote-1577, I have started creating the nf-core module for sylph-tax and I noticed there is no flag for the version. Is it possible to add it?

@bluenote-1577
Copy link
Owner

@sofstam BTW, would it be possible to redirect this issue to https://github.com/bluenote-1577/sylph-tax? That is the new repo.

I just added a version flag to sylph-tax, so you can run sylph-tax --version (if that's what you meant). The release should be also associated with a version.

I updated to v1.1.1 for sylph-tax at https://github.com/bluenote-1577/sylph-tax. Conda may take a little bit to update.

@sofstam
Copy link

sofstam commented Jan 31, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants