Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helixer official docker not finding its own python modules when running in AWS through nextflow #154

Closed
MatteoSchiavinato opened this issue Nov 11, 2024 · 6 comments

Comments

@MatteoSchiavinato
Copy link

MatteoSchiavinato commented Nov 11, 2024

Describe the bug
I have successfully run Helixer locally using the pre-constructed docker container. Note that I am running it without GPU. When I run it on AWS using Nextflow, this happens:

  Traceback (most recent call last):
    File "/home/helixer_user/Helixer/scripts/fetch_helixer_models.py", line 5, in <module>
      from helixer.core.data import prioritized_models, fetch_and_organize_models
  ModuleNotFoundError: No module named 'helixer'

To Reproduce
This is my nextflow process:

process HELIXER {

    input:
    tuple \
    val(alias),
    path(fasta)

    val species
    val lineage

    output:
    tuple \
    val(alias),
    path(fasta),
    path("${alias}.gff3"), emit: models

    script:
    """
    set -e 
    set -o pipefail 

    # change directory so that modules can be found properly 
    cd /home/helixer_user/Helixer

    # download models 
    /home/helixer_user/Helixer/scripts/fetch_helixer_models.py

    # add helixer path to pythonpath 
    export PYTHONPATH="/home/helixer_user/Helixer:\${PYTHONPATH}"

    # run helixer 
    /home/helixer_user/Helixer/Helixer.py \
    --compression lzf \
    --fasta-path ${fasta} \
    --gff-output-path ${alias}.gff3 \
    --lineage ${lineage} \
    --min-coding-length 60 \
    --species ${species}
    """

To reproduce the behavior, all that is needed is to run this process using the preconstructed container as container for the process, with no modification.

Error
The error I'm getting:

  Traceback (most recent call last):
    File "/home/helixer_user/Helixer/scripts/fetch_helixer_models.py", line 5, in <module>
      from helixer.core.data import prioritized_models, fetch_and_organize_models
  ModuleNotFoundError: No module named 'helixer'

Seems to come from the fact that the environment does not find the helixer directory from where to load the local modules prioritized_models and fetch_and_organize_models. Even if I change directory into the one where this folder is, or if I add it to the PYTHONPATH (see the nextflow process above).

Note that this behavior is not observed when running locally, just in AWS. AWS + Nextflow notoriously creates some issues when something is installed in /usr, i.e. chances are this something won't be found when running in the cloud (not always, but frequently).

Environment (please complete the following information):
The official docker from helixer

Suggested fix
I am adding a suggestion to fix it. I haven't fully tested it, but it should work if everything (including CUDA drivers) is installed outside of /usr, preferrably in /opt.

Here's the changes I've done to the docker file:

FROM gglyptodon/helixer-docker:helixer_v0.3.3_cuda_11.8.0-cudnn8
USER root 
WORKDIR /opt
RUN mkdir -p /opt/Helixer_inst \
    && mv /home/helixer_user/* /opt/Helixer_inst/
WORKDIR /opt/Helixer_inst/Helixer
ENV PYTHONPATH="/opt/Helixer_inst/Helixer:${PYTHONPATH}"
CMD ["bash"]

This goes for the python packages too. In fact, it would be important to install them outside of /usr/local, e.g. just /usr.

I have solved at least that issue by setting this in the dockerfile:

ENV PYTHONUSERBASE=/usr 
@felicitas215
Copy link
Collaborator

Hi,
thank you for reporting this issue. Can you try setting the Python path before evoking the first python script?
Like this:

# add helixer path to pythonpath 
export PYTHONPATH="/home/helixer_user/Helixer:\${PYTHONPATH}"

# download models 
/home/helixer_user/Helixer/scripts/fetch_helixer_models.py

@MatteoSchiavinato
Copy link
Author

That was my first attempt, but wasn't successful.

@MatteoSchiavinato
Copy link
Author

Here's a dockerfile I've used to construct an image that fits all the criteria above. I'm attaching it so that you may see if anything is useful for you! It creates a very heavy container (21.9 GB), it would need lots of improvement and perhaps a multi-stage build to reduce size (your original is 7.74 GB). But anyway, thought it was useful.

Dockerfile.txt

@MatteoSchiavinato
Copy link
Author

UPDATE: I've managed to reduce the docker to about 12 GB based on your new commit and some other changes. Here it is (attached).

Dockerfile_v2.txt

@felicitas215
Copy link
Collaborator

Could you try adding RUN pip install --upgrade pip to your Dockerfile. There seems to be an issue with Ubuntu 22.04 not recognizing name and version and installing UNKNOWN 0.0.0 instead of for example helixer-0.3.4. (Link to the bug report: pypa/setuptools#3269).

@felicitas215
Copy link
Collaborator

I'm closing this issue, because there was no activity for 1 month. Feel free to reopen the issue if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants