Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genotype concordance workflows #414

Merged
merged 7 commits into from
Sep 27, 2022
Merged

Genotype concordance workflows #414

merged 7 commits into from
Sep 27, 2022

Conversation

mwalker174
Copy link
Collaborator

Adds two new WDLs:

  1. JoinRawCalls - merges raw variant calls across batches from ClusterBatch outputs
  2. SVConcordance - evaluates genotype concordance of an "eval" vcf against a "truth" vcf.

The second workflow is intended to be run with a "cleaned" eval vcf and a joined raw call truth vcf from the same cohort. The output vcf can then be used for filtering with TrainGqRecalibrator and RecalibrateGq workflows as the annotations_vcf input, along with the appropriate annotations_to_transfer list (TBD).

Note this depends on a pending PR in gatk, so I've added a temporary concordance gatk docker.

Also includes necessary updates to the vcf formatting script. Currently the concordance tool does not handle CPX/CTX so those are filtered out. As a result, the concordance-annotated VCF should NOT be carried forward other than as an input to the GQ recalibrator workflows.

Tested and updated both new workflows with inputs generated from corresponding templates.

Copy link
Contributor

@TedBrookings TedBrookings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions and I'm requesting a few minor, stylistic changes. Overall it looks pretty good.

I have a couple comments about format_svtk_vcf_for_gatk.py that are either 1) not really a request for change or 2) just outside the changes you made and therefore not critical for the PR

  1. Minor style note: I noticed __parse_bnd_ends and __parse_ploidy_table start with double-underscores. I've taken to using double-underscores for __parse_arguments because double-underscores enforce privacy. That way, if a python script does "from other_script import *" and that other script happens to have __parse_arguments in it, nothing bad happens, but if they both had _parse_arguments, that would get clobbered and produce crazy crashes. I don't think it's necessary to use more than single underscore for most other private functions, and it's arguably overkill even for __parse_arguments.

  2. It's better practice to use a context manager than directly handling the closing of file-like objects. e.g.

with vcf_in = pysam.VariantFile(arguments.vcf, 'r')
    # vcf_in will be closed after block ends, including early return, exception, etc
    do_stuff()

Technically you could do that even with filter_out, using nullcontext, although that may be a bit clunky.

src/sv-pipeline/scripts/format_svtk_vcf_for_gatk.py Outdated Show resolved Hide resolved
src/sv-pipeline/scripts/format_svtk_vcf_for_gatk.py Outdated Show resolved Hide resolved
wdl/SVConcordance.wdl Outdated Show resolved Hide resolved
wdl/SVConcordance.wdl Outdated Show resolved Hide resolved
wdl/SVConcordance.wdl Show resolved Hide resolved
wdl/SVConcordance.wdl Outdated Show resolved Hide resolved
@@ -2236,6 +2238,7 @@
"name": "ref_panel_1kg",
"outlier_cutoff_table": "gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v1/module03_outlier_cutoff_table.tsv",
"ped_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/ref-panel/1KG/v1/ped/1kg_ref_panel_v1.ped",
"ploidy_table": "gs://gatk-sv-ref-panel-1kg/outputs/ClusterBatch/mw-concordance/ref_panel_1kg.ploidy.FEMALE_chrY_1.tsv",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should open a ticket to track changes needed to be able to repopulate this JSON with create_test_batch.py including outputting this file from the Batch pipeline, or keeping this value along with the new batch's metadata. I think some other recent changes might need to sync up with the create_test_batch.py script too (ex. handling tarballs of std vcfs vs. the file lists used in the single-sample pipeline) so it would be good to keep track of those

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"SVConcordance.sv_pipeline_docker": {{ dockers.sv_pipeline_docker | tojson }},
"SVConcordance.sv_utils_docker": {{ dockers.sv_utils_docker | tojson }},

"SVConcordance.eval_vcf" : {{ test_batch.merged_depth_vcf | tojson }},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the comparison would be the raw calls vs. the clean VCF, not the clustered depth VCF? (If that's the case - the clean VCF should be GATK-ready except for dropping CTX/CPX for your tool, so I'm not sure that the call to SvUtilsFixVcf is necessary)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed

Copy link
Contributor

@TedBrookings TedBrookings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm changing my review to "approve". I made a couple comments, including one with suggested changes. I'll leave it to your judgement whether to make those changes though, they're not critical.

Comment on lines 159 to 163
new_record = vcf_out.new_record(id=record.id, contig=contig, start=record.start, stop=record.stop, alleles=alleles)
# copy INFO fields
for key in record.info:
if key not in default_remove_infos and key not in remove_infos:
new_record.info[key] = record.info[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two last suggestions for this section. 1) combine the default and passed remove_infos into one set object to get two checks instead of one, and 2) construct the baseline info field as one object so as to avoid multiple API calls to set info. I'll leave it to your judgement to make either of these changes (they can be done independently).

Suggested change
new_record = vcf_out.new_record(id=record.id, contig=contig, start=record.start, stop=record.stop, alleles=alleles)
# copy INFO fields
for key in record.info:
if key not in default_remove_infos and key not in remove_infos:
new_record.info[key] = record.info[key]
remove_infos = remove_infos.union(default_remove_infos)
new_record = vcf_out.new_record(id=record.id, contig=contig, start=record.start, stop=record.stop, alleles=alleles,
info={key: value for key, value in record.info.items() if key not in remove_infos})

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change

wdl/SVConcordance.wdl Outdated Show resolved Hide resolved
@mwalker174 mwalker174 merged commit 1f1a3b3 into master Sep 27, 2022
@mwalker174 mwalker174 deleted the mw_concordance branch September 27, 2022 22:12
"gq_recalibrator_docker": "us.gcr.io/broad-dsde-methods/tbrookin/gatk:0a7e1d86f",
"str": "us.gcr.io/broad-dsde-methods/gatk-sv/str:2022-09-15-v0.25.1-beta-b53e58af"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this PR is not related to EH, if so, I wonder how build_docker.py ended up incorrectly updating this, while it worked correctly when invoked by the bot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to do with the way I invoked build_docker.py - my master branch was out of date on the machine where I did the build but my working branch was rebased on the latest master, which contained your recent changes to some str code. I used "master" as the current commit - so it saw the str updates relative to the outdated master and rebuilt the docker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That explains it, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants