Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to run full pipeline locally #2799

Merged
merged 4 commits into from
Jun 25, 2021
Merged

Conversation

kurtwheeler
Copy link
Contributor

Issue Number

#2775

Purpose/Implementation Notes

This adds a python script that invokes bash scripts to run surveyor, downloader, and processor jobs for a given experiment accession (or organism/Ensembl division in the case of transcriptome indices).

I also ended up doing some refactoring because the transcriptome jobs were failing because of an issue in needs_processing. Fixing that issue ended up creating a cyclical import issue. The right way to fix it seemed to split the job_lookup namespace because it was really doing two things.

Types of changes

  • New feature (non-breaking change which adds functionality)

Functional tests

I've run transcriptome and no_op smoothly and I'm running others still.

Checklist

  • Lint and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in downstream modules

@kurtwheeler kurtwheeler requested a review from wvauclain June 24, 2021 15:34
@@ -0,0 +1,22 @@
#!/bin/bash

# Reintializes the database so there's no data or migrations run against it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fancy, I like it

Copy link
Contributor

@wvauclain wvauclain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small things

If there are no unstarted ProcessorJobs, then the most
recently created unstarted DownloaderJob will be prioritized.

The dict will have two top level keys: downloader_jobs and processor_jobs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line is outdated



def parse_args():
description = """This script can be used to run the full pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

elif job_name == "NO_OP":
image_name = "no_op"

completed_command = subprocess.check_call(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here, it looks like check_call throws if the return code is nonzero, rather than returning the nonzero error code.

Also, does the output of this get sent to the terminal? That would be preferable, or at least somewhere that the output can be inspected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch. I guess I don't gotta check the return codes.

The output does get sent to the terminal. It's a little bit chatty at the moment because the run_job.sh and run_surveyor.sh scripts rebuild the docker images every time but it is nice having them take care of that for you as you iterate.

I think that maybe in the future we could have the pipeline script look at what docker images would need to be built and build them once before kicking off work, but that's a larger refactor to the way the scripts work. In this PR I just wanted to get the flow working with the existing scripts and it already ended up being nontrivial.

image_name,
subcommand,
f"--job-name={job_name}",
# job_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can get rid of these commented-out elements



def survey_accession(accession_code):
completed_command = subprocess.check_call(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above


job_to_run = get_job_to_run()

while job_to_run:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to make this while job_to_run is not None: and then explicitly return None in get_job_to_run() if we fall through all the if statements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one if statement in get_job_to_run. The get_job_to_be_run management command has several but it has to return JSON, and None isn't valid JSON. Instead get_job_to_be_run returns a dict representing a processor job, a downloader job, or an empty dict. An empty dict {} is both valid JSON and falsey in python so it works here I think.

Is there any advantage to using an explicit None here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I didn't think of that. It's good as is then.

@kurtwheeler kurtwheeler merged commit 65b724c into dev Jun 25, 2021
@kurtwheeler kurtwheeler deleted the kurtwheeler/run-locally branch June 25, 2021 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants