Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor gatk germline (resolves #280) #335

Merged
merged 29 commits into from
Sep 24, 2016

Conversation

jpfeil
Copy link
Contributor

@jpfeil jpfeil commented Jun 30, 2016

@jpfeil jpfeil force-pushed the issues/280-refactor-germline-pipeline branch 6 times, most recently from e86be03 to 16bdfaf Compare July 1, 2016 22:57
@jpfeil jpfeil force-pushed the issues/280-refactor-germline-pipeline branch from b8bcde7 to b8fa460 Compare July 15, 2016 04:52
@jvivian
Copy link
Collaborator

jvivian commented Jul 15, 2016

@jpfeil — Ping me when tests pass and I'll review

@jpfeil jpfeil force-pushed the issues/280-refactor-germline-pipeline branch from dc0bb84 to 173d82d Compare July 16, 2016 22:17
@hannes-ucsc
Copy link
Contributor

hannes-ucsc commented Jul 18, 2016

@jpfeil do you want us to review this? If not, please set the "needs work" label.

Come review time, I will also ask for the commit titles to all be prefixed in such a way that they connote the pipeline they refer to.

@jpfeil jpfeil force-pushed the issues/280-refactor-germline-pipeline branch from 2b2cca5 to 7cc21a9 Compare July 19, 2016 02:12
@jpfeil jpfeil force-pushed the issues/280-refactor-germline-pipeline branch from 14a268a to d8c80ab Compare August 2, 2016 23:59
@jpfeil jpfeil force-pushed the issues/280-refactor-germline-pipeline branch 3 times, most recently from af3b95e to da423fa Compare August 3, 2016 21:48
@fnothaft
Copy link
Contributor

fnothaft commented Aug 4, 2016

@jpfeil Just dropping a reminder to add a list with the required/complete features to this PR.

@jpfeil
Copy link
Contributor Author

jpfeil commented Aug 4, 2016

  • Support downloading files from synapse
  • Add BWA alignment
  • Support preprocessing only
  • Refactor pre-processing using common library functions
  • Add hard filtering for runs with fewer than 30 samples
  • Support cohort-level VQSR
  • Annotate clinical variants with Oncotator

@fnothaft
Copy link
Contributor

fnothaft commented Aug 4, 2016

@jpfeil please make the list more specific. What are the changes you are making? What features were requested to be added, removed, changed, etc. E.g., instead of: "Hard filtering", say "Hard filter for sites covered by fewer than 10 reads". Specifics!

@fnothaft
Copy link
Contributor

fnothaft commented Aug 5, 2016

So this PR greatly expanded in scope from what we discussed at the start: eliminate dupe code between germline and somatic pipelines, focus on validation. We need this to merge for the ADAM recompute, so I'd like to triage the Oncotator/annotation and Synapse bits out to a follow-on PR, and set a deadline for validating this code and merging the PR. Specifically, I'd like to propose freezing the features, wrapping up development by Tuesday of next week, and shooting to validate the pipeline on a single sample by the Tuesday after.

I think right now, if we triage the annotation/Synapse work out to another PR, that leaves joint calling as the last feature. At a cursory glance, that code is implemented already in this PR. I am concerned about the implementation—the GVCFs we are using for SGDP are O(25GB), so joint calling on a single node is out of the question—@jpfeil, what is the plan for addressing that issue?

I'm proposing the following action items:

  • @fnothaft: make review pass by end-of-day today
  • @jpfeil: open ticket to triage out annotation/Synapse work
  • @jpfeil: write up validation strategy and timeline on this PR (can you do this in time for Monday scrum?)
  • @jpfeil: explain joint calling architecture and how the joint calling process will get around excessive data volumes
  • @fnothaft @jvivian @jpfeil: go through together and audit resource requirements passed to GATK stages
  • @jpfeil: factor out variant filtration calls to toil_scripts.tools library

Does this proposal seem reasonable? CC @briandoconnor for review/work planning.

@jpfeil
Copy link
Contributor Author

jpfeil commented Aug 5, 2016

Okay that sounds like a good plan. I will add a modified job function that handles running a single sample through the pipeline with VQSR.

@fnothaft
Copy link
Contributor

fnothaft commented Aug 5, 2016

Shouldn't the single sample case fall under the hard filtering? Also, I thought we had an override that allows running <30 samples through VQSR.

@@ -108,10 +108,10 @@ def preprocessing_declaration(job, config):
job.fileStore.logToMaster('Ran preprocessing: ' + config.uuid)
disk = '1G' if config.ci_test else '20G'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a change here, but I've seen OOD failures with the preprocessing pipeline in the wild. I'll create a note for you me and @jvivian to go through this and audit the disk requirements before merging.

@jpfeil
Copy link
Contributor Author

jpfeil commented Aug 5, 2016

I can add a check to see if there are enough variants in the sample before running VQSR. Can you send me a description of your samples?


elif {ext1, ext2} & {'.bam'}:
Copy link
Contributor

@fnothaft fnothaft Aug 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add SAM/CRAM too?

@jpfeil jpfeil force-pushed the issues/280-refactor-germline-pipeline branch from d59dcd2 to 94f20c6 Compare September 22, 2016 19:12
@jpfeil jpfeil force-pushed the issues/280-refactor-germline-pipeline branch 2 times, most recently from bf725da to 3457393 Compare September 22, 2016 19:52
@jpfeil jpfeil force-pushed the issues/280-refactor-germline-pipeline branch from 3457393 to be3f011 Compare September 23, 2016 00:07
@jpfeil
Copy link
Contributor Author

jpfeil commented Sep 23, 2016

The test passed @jvivian

@@ -0,0 +1,32 @@
#!/usr/bin/env python2.7
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in its own .py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent circular imports

:param work_dir: working directory
:param ids: shared file promises, dict
:param filenames: remaining arguments are filenames
class GermlineSample(namedtuple('GermlineSample', 'uuid url paired_url rg_line')):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this — Why not just use a Namespace object for every sample?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is clearer because you can link to this class in docstrings and it documents the requirements for a sample in one location.

disk=PromisedRequirement(lambda x: x.size, annotated_vcf.rv()))


#####################################
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hannes does not like special exceptions for comments, either use one # or triple quotes if you want to make it stand out.

else:
bwa_config.r1 = input1.rv()

# Use first URL to deduce paired FASTQ URL, if url looks like a paired file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is a good heuristic. I see a lot of unpaired samples that have R1.fq or 1.fq in them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay that's a good point

@@ -0,0 +1,146 @@
import textwrap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe change to germline_config_manifest.py ?

@@ -0,0 +1,137 @@
#!/usr/bin/env python2.7
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in its own .py file? It seems like this and the other file either belong in toil-lib or should be in the germline source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's easier to read if the independent components of the pipeline are in separate modules.

@jvivian jvivian merged commit 2d7e5bc into master Sep 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor germline pipeline using common lib
4 participants