Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: update local dev setup #253

Closed
wants to merge 2 commits into from
Closed

Conversation

ivan-aksamentov
Copy link
Member

@ivan-aksamentov ivan-aksamentov commented Dec 14, 2021

This refreshes the local dev setup now that we use snakemake (#231)

  • Revamp the Dockerfile. It is no longer needed for production, so I think it is dev-only. I could not find the source for the base image and we don't need it anyways.
  • Add helper script to run the container
  • Add missing snakemake package to Pipfile

This setups is what I used for #237

I cancelled GHA because this PR only affects local dev flow.

bin/docker-run Outdated
Comment on lines 11 to 19
declare -a config
config+=(
database_name="gisaid"
s3_src="s3://nextstrain-ncov-private/branch/${GITHUB_BRANCH}"
s3_dst="s3://nextstrain-ncov-private/branch/${GITHUB_BRANCH}"
fetch_from_database=false
trigger_preprocessing=false
keep_all_files: true
)
Copy link
Member Author

@ivan-aksamentov ivan-aksamentov Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that, as before, because workflow interacts with prod data, it's extremely easy to mess up the daily ingest by just setting a wrong S3 url here. So be careful.

@ivan-aksamentov ivan-aksamentov requested a review from a team December 15, 2021 16:26
Comment on lines 28 to +34
# version becomes the new default, if this ncov-ingest image itself is still
# relevant at that time.
# -trs, 19 Jan 2020
FROM nextstrain/base:branch-python-base

# Install Python package for which Python 3.7 wheels do not yet exist on PyPI.
RUN apt-get update && apt-get install -y --no-install-recommends \
python3-netifaces \
time\
xz-utils
# Let's drop dependency on an image of unknown origin and start over.
# -ivan.aksamentov 2021-12-13
FROM python:3.7-buster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused on historical context here, but maybe @tsibley's TODO comment can be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python-base has been merged, so I think we can just use nextstrain/base:latest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A-ha! The mystery of a missing container image solved!
We don't need anything from the base image though, and pretty much anything other than python and pipenv.

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for prompting this update Ivan! It's been a ~while since anyone has touched this Dockerfile.

Revamp the Dockerfile. It is no longer needed for production, so I think it is dev-only. I could not find the source for the base image and we don't need it anyways.

The GitHub actions use this docker image so it's not just dev-only, but I definitely agree it should be updated. I think we can use nextstrain/base:latest with additional dependencies on top.

Add helper script to run the container

For the bin/docker-run script, I think it would be best to mimic how the pipeline runs in production by using the nextstrain cli (this might also just be me not wanting to deal directly with docker 😅 )

Add missing snakemake package to Pipfile

The pipeline has been running fine without explicitly adding snakemake here...(Aha, snakemake is included in the nextstrain base image)

@ivan-aksamentov
Copy link
Member Author

@joverlee521 Oh no! I overlooked that it's used in prod. In this case, nothing here makes sense.

I deal with ingest a lot lately and it's not a great dev experience, so I sketched this minimal thing that runs and a more convenient dev flow, where I can modify things quickly. I could not even build the old image, because it uses apt repos from Debian version that is too old. I also don't know how the docker option of nextstrain cli works. Only used the AWS Batch path so far. I might try it sometimes and see how it goes.

@victorlin
Copy link
Member

Still worthwhile to change nextstrain/base:branch-python-base to nextstrain/base:latest though?

@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented Dec 16, 2021

@victorlin Probably. Might also make dockerignore more restrictive like here. It is currently copying the stuff from the repo into the build context, including source-data/ which is heavy, while only 2 pipenv files are needed during build.

ivan-aksamentov added a commit that referenced this pull request Dec 17, 2021
This:

 - Removes duplicated nextclade downloads. Nobody knows wha version they are and when they are updated. Latest version of nextclade is already downloaded in the `./bin/run-nextclade` and `./bin/run-nextclade-fill` scripts. We want it to always be cutting edge. Nextclade releases are frequent. We may consider bringing it back into the image later if we setup automated rebuild of the image on Nextclade release. However I also often play with different custom versions of Nextclade in full runs and having other versions duplicated versions may cause mistakes. Mixed feelings here.

 - Adds various tools we might want to play with on branches (related to #247 #256):
    - aria2 - for downloading on multiple connections
    - lbzip2  - for parallel unbzipping
    - pigz - for parallel gz/ungz
    - pixz - for parallel xz/unxz

 - Changes base image to as suggested in #253. We don't use much out of it, so I am not super convinced it's needed, but not a big deal.

Current inconvenience is that for any changes in deps on branches (e.g. if I want to play with aria2c and lbzip2) we need to change the container on master and/or publish an image and change the tag in the GitHub Action on master. Which both disrupt production. We might consider installing deps dynamically on AWS batch instead. I tired and it failed because currently the debian repo cache in the image is too old. Or maybe there's a better way to play with new tools right now? In any case, mixing prod and dev as it is now seems weird and unproductive, we might think about a dedicated, simpler dev flow.
@victorlin victorlin deleted the chore/local-dev branch February 20, 2024 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants