-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script to run full pipeline locally #2799
Conversation
@@ -0,0 +1,22 @@ | |||
#!/bin/bash | |||
|
|||
# Reintializes the database so there's no data or migrations run against it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fancy, I like it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few small things
If there are no unstarted ProcessorJobs, then the most | ||
recently created unstarted DownloaderJob will be prioritized. | ||
|
||
The dict will have two top level keys: downloader_jobs and processor_jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line is outdated
|
||
|
||
def parse_args(): | ||
description = """This script can be used to run the full pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
scripts/run_full_pipeline.py
Outdated
elif job_name == "NO_OP": | ||
image_name = "no_op" | ||
|
||
completed_command = subprocess.check_call( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From here, it looks like check_call
throws if the return code is nonzero, rather than returning the nonzero error code.
Also, does the output of this get sent to the terminal? That would be preferable, or at least somewhere that the output can be inspected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good catch. I guess I don't gotta check the return codes.
The output does get sent to the terminal. It's a little bit chatty at the moment because the run_job.sh
and run_surveyor.sh
scripts rebuild the docker images every time but it is nice having them take care of that for you as you iterate.
I think that maybe in the future we could have the pipeline script look at what docker images would need to be built and build them once before kicking off work, but that's a larger refactor to the way the scripts work. In this PR I just wanted to get the flow working with the existing scripts and it already ended up being nontrivial.
scripts/run_full_pipeline.py
Outdated
image_name, | ||
subcommand, | ||
f"--job-name={job_name}", | ||
# job_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can get rid of these commented-out elements
scripts/run_full_pipeline.py
Outdated
|
||
|
||
def survey_accession(accession_code): | ||
completed_command = subprocess.check_call( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above
|
||
job_to_run = get_job_to_run() | ||
|
||
while job_to_run: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to make this while job_to_run is not None:
and then explicitly return None
in get_job_to_run()
if we fall through all the if statements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only one if statement in get_job_to_run
. The get_job_to_be_run
management command has several but it has to return JSON, and None
isn't valid JSON. Instead get_job_to_be_run
returns a dict representing a processor job, a downloader job, or an empty dict. An empty dict {}
is both valid JSON and falsey in python so it works here I think.
Is there any advantage to using an explicit None
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I didn't think of that. It's good as is then.
Issue Number
#2775
Purpose/Implementation Notes
This adds a python script that invokes bash scripts to run surveyor, downloader, and processor jobs for a given experiment accession (or organism/Ensembl division in the case of transcriptome indices).
I also ended up doing some refactoring because the transcriptome jobs were failing because of an issue in
needs_processing
. Fixing that issue ended up creating a cyclical import issue. The right way to fix it seemed to split thejob_lookup
namespace because it was really doing two things.Types of changes
Functional tests
I've run transcriptome and no_op smoothly and I'm running others still.
Checklist