-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Be explicit about when VariantContexts are biallelic #8332
Conversation
Github actions tests reported job failures from actions build 5040871715
|
Github actions tests reported job failures from actions build 5042377589
|
Github actions tests reported job failures from actions build 5051256043
|
private static final long serialVersionUID = 1L; | ||
|
||
public BiallelicVariant(final VariantContext vc) { | ||
super(vc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea for a fix... I ALMOST want to suggest we disentangle this from all of the heavy VCF construction validation steps in HTSJDK that we usually undergo when we call the VCF constructor... those can actually show up as a non-insignificant amount of the runtime in profilers in some circumstances....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also tempted. It's doable to replace this with a class (maybe call it Event
to parallel EventMap
?) that contains just a ref, alt, and position. Think I should?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in a lot of ways its cleaner... I like Event <-> EventMap... it could also have an event.asVCFLine() argument to convert it more lazily when we actually need a vcf in the code... It would mean pulling a lot of functionality that the htsjdk vcf was previously handing into this class however
src/main/java/org/broadinstitute/hellbender/utils/haplotype/BiallelicVariant.java
Outdated
Show resolved
Hide resolved
.../broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerIntegrationTest.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/utils/haplotype/BiallelicVariant.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/utils/haplotype/EventMap.java
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/genotyper/GenotypingEngine.java
Show resolved
Hide resolved
...main/java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyResultSet.java
Outdated
Show resolved
Hide resolved
// sort by score = # of kmers in haplotype that appear in any read -- the '-' sign sorts greatest to least | ||
// TODO this code might use some normalizing based on haplotype length in the future | ||
return onlyNewHaplotypes.stream() | ||
.sorted(Comparator.comparingLong(hap -> -kmerizeSequence(hap.getBases(), kmerSize).stream() | ||
.distinct() // TODO: The input haplotypes are not always unique. Is this a horrible code smell? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Terrifying... This haplotype kmer in the reads heuristic deserves a re-evaluation at some point since there a demons in there somewhere since it never really got a proper evaluation... For the sake of this refactor do that elsewehre though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though with representation issues i'm not surprised if you could accidentally re-create an existing haplotype by injecting a new allele into a bunch of the assembly haplotypes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's bad that this method is given a List
with non-unique elements, but it's totally natural that the process of generating haplotypes would be redundant. I'm going to use a Set
instead of a List
to deal with the problem upstream.
#carrot(HaplotypeCaller CARROT Regression Tests, VariantCallingCarrotOrchestrated.gatk_docker, ) |
🥕CARROT🥕 run startedTest: HaplotypeCaller CARROT Regression Tests | Status: buildingRun: HaplotypeCaller CARROT Regression Tests_run_2023-06-02 17:26:47.096893136 UTC Full details
{
"run_id": "1097e2ec-85ce-4fa8-b59e-fc82a8ea8bed",
"test_id": "c3de522b-7ce5-4a51-8b57-1cea628dd93a",
"run_group_ids": [],
"name": "HaplotypeCaller CARROT Regression Tests_run_2023-06-02 17:26:47.096893136 UTC",
"status": "building",
"test_wdl": "gs://dsp-methods-carrot-data/wdl-prod/8fce9006-acbf-48ed-984a-2ec988d82eea/test.wdl",
"test_wdl_hash": "272dc41890e3710cc96c32af03df25065cc4aa9dc389e3c2473bddba7be237db3e0698c15ef287c4619cff83e9b2e8e5e0a486eb4534658604e4bb312f308611",
"test_wdl_dependencies": null,
"test_wdl_dependencies_hash": null,
"eval_wdl": "gs://dsp-methods-carrot-data/wdl-prod/7e3704ce-f26c-4465-a6ab-f64faca619f4/eval.wdl",
"eval_wdl_hash": "8cecc1e6a3ade904ed3bfaae834df6aeac9b50fbc08966557f9e0c1628058b2c64d080f78d0cad222b4b02400db47d540d3a1bcdb8275c475b49a027bf330605",
"eval_wdl_dependencies": null,
"eval_wdl_dependencies_hash": null,
"test_input": {
"VariantCallingCarrotOrchestrated.CHM_base_file_name": "CHM113",
"VariantCallingCarrotOrchestrated.CHM_calling_interval_list": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/wgs_calling_regions.hg38.interval_list",
"VariantCallingCarrotOrchestrated.CHM_contamination": 0.0,
"VariantCallingCarrotOrchestrated.CHM_input_bam": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/chm1_chm13_hiseqx_sm_hf3mo.bam",
"VariantCallingCarrotOrchestrated.CHM_input_bam_index": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/NA24385_NA24385_O1D1_SM-G947H_v1_NS.bai",
"VariantCallingCarrotOrchestrated.NIST_base_file_name": "NA24385",
"VariantCallingCarrotOrchestrated.NIST_calling_interval_list": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/wgs_calling_regions.hg38.interval_list",
"VariantCallingCarrotOrchestrated.NIST_contamination": 0.0383312,
"VariantCallingCarrotOrchestrated.NIST_input_bam": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/NA24385_NA24385_O1D1_SM-G947H_v1_NS.bam",
"VariantCallingCarrotOrchestrated.NIST_input_bam_index": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/NA24385_NA24385_O1D1_SM-G947H_v1_NS.bai",
"VariantCallingCarrotOrchestrated.agg_preemptible_tries": 3,
"VariantCallingCarrotOrchestrated.break_bands_at_multiples_of": 100000,
"VariantCallingCarrotOrchestrated.contamination": 0.0,
"VariantCallingCarrotOrchestrated.exome1_base_file_name": "NA12878Exome1",
"VariantCallingCarrotOrchestrated.exome1_calling_interval_list": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/exome_calling_regions.v1.interval_list",
"VariantCallingCarrotOrchestrated.exome1_input_bam": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/NA12878_forPCRplus_1.cram",
"VariantCallingCarrotOrchestrated.exome1_input_bam_index": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/NA12878_forPCRplus_1.cram.crai",
"VariantCallingCarrotOrchestrated.gatk_control_docker": "broadinstitute/gatk-nightly:latest",
"VariantCallingCarrotOrchestrated.gatk_docker": "image_build:gatk|523daad449350ce16eae2ba600f3a37e54ed7674",
"VariantCallingCarrotOrchestrated.haplotype_scatter_count": 50,
"VariantCallingCarrotOrchestrated.monitoring_script": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/cromwell_monitoring_script.sh",
"VariantCallingCarrotOrchestrated.ref_dict": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/Homo_sapiens_assembly38.dict",
"VariantCallingCarrotOrchestrated.ref_fasta": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/Homo_sapiens_assembly38.fasta",
"VariantCallingCarrotOrchestrated.ref_fasta_index": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/Homo_sapiens_assembly38.fasta.fai",
"VariantCallingCarrotOrchestrated.use_gatk3_haplotype_caller": true
},
"test_options": {
"read_from_cache": false
},
"eval_input": {
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_confidenceInterval": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/chm.full.m38.interval_list",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_controlLabel": "CONTROLSNAPSHOT2018HG002",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_controlMonitoringExample": "test_output:VariantCallingCarrotOrchestrated.CHM_control_representative_benchmarking",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_controlRuntimeSummaries": "test_output:VariantCallingCarrotOrchestrated.CHM_control_output_runtimes",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_controlVcf": "test_output:VariantCallingCarrotOrchestrated.CHM_control_vcf",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_controlVcfIndex": "test_output:VariantCallingCarrotOrchestrated.CHM_control_vcf_index",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_evalLabel": "TESTSNAPSHOT2018HG002",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_evalMonitoringExample": "test_output:VariantCallingCarrotOrchestrated.CHM_representative_benchmarking",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_evalRuntimeSummaries": "test_output:VariantCallingCarrotOrchestrated.CHM_output_runtimes",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_evalVcf": "test_output:VariantCallingCarrotOrchestrated.CHM_output_vcf",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_evalVcfIndex": "test_output:VariantCallingCarrotOrchestrated.CHM_output_vcf_index",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_truthLabel": "CHM_GRCh38_SYNDIPv20180222",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_truthVcf": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/chm.full.m38.vcf.gz",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_truthVcfIndex": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/chm.full.m38.vcf.gz.tbi",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_confidenceInterval": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/GIAB_v3.3.2_NA12878_hg38.twist_exome.interval_list",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_controlLabel": "CONTROLSNAPSHOT2018HG002",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_controlMonitoringExample": "test_output:VariantCallingCarrotOrchestrated.EXOME1_control_representative_benchmarking",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_controlRuntimeSummaries": "test_output:VariantCallingCarrotOrchestrated.EXOME1_control_output_runtimes",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_controlVcf": "test_output:VariantCallingCarrotOrchestrated.EXOME1_control_vcf",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_controlVcfIndex": "test_output:VariantCallingCarrotOrchestrated.EXOME1_control_vcf_index",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_evalLabel": "TESTSNAPSHOT2018HG002",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_evalMonitoringExample": "test_output:VariantCallingCarrotOrchestrated.EXOME1_representative_benchmarking",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_evalRuntimeSummaries": "test_output:VariantCallingCarrotOrchestrated.EXOME1_output_runtimes",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_evalVcf": "test_output:VariantCallingCarrotOrchestrated.EXOME1_output_vcf",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_evalVcfIndex": "test_output:VariantCallingCarrotOrchestrated.EXOME1_output_vcf_index",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_truthLabel": "NA12878_GRCh38_TWISTExome",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_truthVcf": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/GIAB_v3.3.2_NA12878_hg38.vcf.gz",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_truthVcfIndex": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/GIAB_v3.3.2_NA12878_hg38.vcf.gz.tbi",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_confidenceInterval": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/HG002_GRCh38_GIAB_1_22_v4.2.1_benchmark_noinconsistent.bed",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_controlLabel": "CONTROLSNAPSHOT2018HG002",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_controlMonitoringExample": "test_output:VariantCallingCarrotOrchestrated.NIST_control_representative_benchmarking",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_controlRuntimeSummaries": "test_output:VariantCallingCarrotOrchestrated.NIST_control_output_runtimes",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_controlVcf": "test_output:VariantCallingCarrotOrchestrated.NIST_control_vcf",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_controlVcfIndex": "test_output:VariantCallingCarrotOrchestrated.NIST_control_vcf_index",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_evalLabel": "TESTSNAPSHOT2018HG002",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_evalMonitoringExample": "test_output:VariantCallingCarrotOrchestrated.NIST_representative_benchmarking",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_evalRuntimeSummaries": "test_output:VariantCallingCarrotOrchestrated.NIST_output_runtimes",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_evalVcf": "test_output:VariantCallingCarrotOrchestrated.NIST_output_vcf",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_evalVcfIndex": "test_output:VariantCallingCarrotOrchestrated.NIST_output_vcf_index",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_truthLabel": "HG002_GRCh38_GIAB",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_truthVcf": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/HG002_GRCh38_GIAB_1_22_v4.2.1_benchmark.broad-header.vcf.gz",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_truthVcfIndex": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/HG002_GRCh38_GIAB_1_22_v4.2.1_benchmark.broad-header.vcf.gz.tbi",
"BenchmarkVCFsHeadToHeadOrchestrated.hapMap": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/Homo_sapiens_assembly38.haplotype_database.txt",
"BenchmarkVCFsHeadToHeadOrchestrated.refDict": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/Homo_sapiens_assembly38.dict",
"BenchmarkVCFsHeadToHeadOrchestrated.refIndex": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/Homo_sapiens_assembly38.fasta.fai",
"BenchmarkVCFsHeadToHeadOrchestrated.reference": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/Homo_sapiens_assembly38.fasta",
"BenchmarkVCFsHeadToHeadOrchestrated.referenceVersion": "HG38",
"BenchmarkVCFsHeadToHeadOrchestrated.stratIntervals": [
"gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/HCR_hg38.bed",
"gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/LCR_Hg38.interval_list"
],
"BenchmarkVCFsHeadToHeadOrchestrated.stratLabels": [
"HCR",
"LCR"
]
},
"eval_options": {
"read_from_cache": false
},
"test_cromwell_job_id": null,
"eval_cromwell_job_id": null,
"created_at": "2023-06-02T17:26:47.097005",
"created_by": null,
"finished_at": null,
"results": null,
"errors": null
}
|
@jamesemery The commit history is now so messy that the overall diff is much more intelligible than the individual commit diffs. Few files have nontrivial changes. Here's an overview:
That's it for the trivially-changed classes. Now for the few substantive changes. . . |
@jamesemery and now the overview of the more complex changes:
After going through this exercise I realize that it's not actually so much. The diff's bark is worse than its bite. The overwhelming majority of changes are either replacing |
523daad
to
525258f
Compare
… events, and force-calling alleles
525258f
to
33ac71e
Compare
@jamesemery Back to you for further review. This is big but not as monstrous as I made it out to be yesterday. |
🥕CARROT🥕 run finishedTest: HaplotypeCaller CARROT Regression Tests | Status: succeededRun: HaplotypeCaller CARROT Regression Tests_run_2023-06-02 17:26:47.096893136 UTC Results
Full details
{
"run_id": "1097e2ec-85ce-4fa8-b59e-fc82a8ea8bed",
"test_id": "c3de522b-7ce5-4a51-8b57-1cea628dd93a",
"run_group_ids": [],
"name": "HaplotypeCaller CARROT Regression Tests_run_2023-06-02 17:26:47.096893136 UTC",
"status": "succeeded",
"test_wdl": "gs://dsp-methods-carrot-data/wdl-prod/8fce9006-acbf-48ed-984a-2ec988d82eea/test.wdl",
"test_wdl_hash": "272dc41890e3710cc96c32af03df25065cc4aa9dc389e3c2473bddba7be237db3e0698c15ef287c4619cff83e9b2e8e5e0a486eb4534658604e4bb312f308611",
"test_wdl_dependencies": null,
"test_wdl_dependencies_hash": null,
"eval_wdl": "gs://dsp-methods-carrot-data/wdl-prod/7e3704ce-f26c-4465-a6ab-f64faca619f4/eval.wdl",
"eval_wdl_hash": "8cecc1e6a3ade904ed3bfaae834df6aeac9b50fbc08966557f9e0c1628058b2c64d080f78d0cad222b4b02400db47d540d3a1bcdb8275c475b49a027bf330605",
"eval_wdl_dependencies": null,
"eval_wdl_dependencies_hash": null,
"test_input": {
"VariantCallingCarrotOrchestrated.CHM_base_file_name": "CHM113",
"VariantCallingCarrotOrchestrated.CHM_calling_interval_list": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/wgs_calling_regions.hg38.interval_list",
"VariantCallingCarrotOrchestrated.CHM_contamination": 0.0,
"VariantCallingCarrotOrchestrated.CHM_input_bam": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/chm1_chm13_hiseqx_sm_hf3mo.bam",
"VariantCallingCarrotOrchestrated.CHM_input_bam_index": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/NA24385_NA24385_O1D1_SM-G947H_v1_NS.bai",
"VariantCallingCarrotOrchestrated.NIST_base_file_name": "NA24385",
"VariantCallingCarrotOrchestrated.NIST_calling_interval_list": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/wgs_calling_regions.hg38.interval_list",
"VariantCallingCarrotOrchestrated.NIST_contamination": 0.0383312,
"VariantCallingCarrotOrchestrated.NIST_input_bam": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/NA24385_NA24385_O1D1_SM-G947H_v1_NS.bam",
"VariantCallingCarrotOrchestrated.NIST_input_bam_index": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/NA24385_NA24385_O1D1_SM-G947H_v1_NS.bai",
"VariantCallingCarrotOrchestrated.agg_preemptible_tries": 3,
"VariantCallingCarrotOrchestrated.break_bands_at_multiples_of": 100000,
"VariantCallingCarrotOrchestrated.contamination": 0.0,
"VariantCallingCarrotOrchestrated.exome1_base_file_name": "NA12878Exome1",
"VariantCallingCarrotOrchestrated.exome1_calling_interval_list": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/exome_calling_regions.v1.interval_list",
"VariantCallingCarrotOrchestrated.exome1_input_bam": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/NA12878_forPCRplus_1.cram",
"VariantCallingCarrotOrchestrated.exome1_input_bam_index": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/NA12878_forPCRplus_1.cram.crai",
"VariantCallingCarrotOrchestrated.gatk_control_docker": "broadinstitute/gatk-nightly:latest",
"VariantCallingCarrotOrchestrated.gatk_docker": "image_build:gatk|523daad449350ce16eae2ba600f3a37e54ed7674",
"VariantCallingCarrotOrchestrated.haplotype_scatter_count": 50,
"VariantCallingCarrotOrchestrated.monitoring_script": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/cromwell_monitoring_script.sh",
"VariantCallingCarrotOrchestrated.ref_dict": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/Homo_sapiens_assembly38.dict",
"VariantCallingCarrotOrchestrated.ref_fasta": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/Homo_sapiens_assembly38.fasta",
"VariantCallingCarrotOrchestrated.ref_fasta_index": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/Homo_sapiens_assembly38.fasta.fai",
"VariantCallingCarrotOrchestrated.use_gatk3_haplotype_caller": true
},
"test_options": {
"read_from_cache": false
},
"eval_input": {
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_confidenceInterval": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/chm.full.m38.interval_list",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_controlLabel": "CONTROLSNAPSHOT2018HG002",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_controlMonitoringExample": "test_output:VariantCallingCarrotOrchestrated.CHM_control_representative_benchmarking",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_controlRuntimeSummaries": "test_output:VariantCallingCarrotOrchestrated.CHM_control_output_runtimes",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_controlVcf": "test_output:VariantCallingCarrotOrchestrated.CHM_control_vcf",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_controlVcfIndex": "test_output:VariantCallingCarrotOrchestrated.CHM_control_vcf_index",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_evalLabel": "TESTSNAPSHOT2018HG002",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_evalMonitoringExample": "test_output:VariantCallingCarrotOrchestrated.CHM_representative_benchmarking",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_evalRuntimeSummaries": "test_output:VariantCallingCarrotOrchestrated.CHM_output_runtimes",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_evalVcf": "test_output:VariantCallingCarrotOrchestrated.CHM_output_vcf",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_evalVcfIndex": "test_output:VariantCallingCarrotOrchestrated.CHM_output_vcf_index",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_truthLabel": "CHM_GRCh38_SYNDIPv20180222",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_truthVcf": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/chm.full.m38.vcf.gz",
"BenchmarkVCFsHeadToHeadOrchestrated.CHM_truthVcfIndex": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/chm.full.m38.vcf.gz.tbi",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_confidenceInterval": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/GIAB_v3.3.2_NA12878_hg38.twist_exome.interval_list",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_controlLabel": "CONTROLSNAPSHOT2018HG002",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_controlMonitoringExample": "test_output:VariantCallingCarrotOrchestrated.EXOME1_control_representative_benchmarking",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_controlRuntimeSummaries": "test_output:VariantCallingCarrotOrchestrated.EXOME1_control_output_runtimes",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_controlVcf": "test_output:VariantCallingCarrotOrchestrated.EXOME1_control_vcf",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_controlVcfIndex": "test_output:VariantCallingCarrotOrchestrated.EXOME1_control_vcf_index",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_evalLabel": "TESTSNAPSHOT2018HG002",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_evalMonitoringExample": "test_output:VariantCallingCarrotOrchestrated.EXOME1_representative_benchmarking",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_evalRuntimeSummaries": "test_output:VariantCallingCarrotOrchestrated.EXOME1_output_runtimes",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_evalVcf": "test_output:VariantCallingCarrotOrchestrated.EXOME1_output_vcf",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_evalVcfIndex": "test_output:VariantCallingCarrotOrchestrated.EXOME1_output_vcf_index",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_truthLabel": "NA12878_GRCh38_TWISTExome",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_truthVcf": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/GIAB_v3.3.2_NA12878_hg38.vcf.gz",
"BenchmarkVCFsHeadToHeadOrchestrated.EXOME1_truthVcfIndex": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/GIAB_v3.3.2_NA12878_hg38.vcf.gz.tbi",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_confidenceInterval": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/HG002_GRCh38_GIAB_1_22_v4.2.1_benchmark_noinconsistent.bed",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_controlLabel": "CONTROLSNAPSHOT2018HG002",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_controlMonitoringExample": "test_output:VariantCallingCarrotOrchestrated.NIST_control_representative_benchmarking",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_controlRuntimeSummaries": "test_output:VariantCallingCarrotOrchestrated.NIST_control_output_runtimes",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_controlVcf": "test_output:VariantCallingCarrotOrchestrated.NIST_control_vcf",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_controlVcfIndex": "test_output:VariantCallingCarrotOrchestrated.NIST_control_vcf_index",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_evalLabel": "TESTSNAPSHOT2018HG002",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_evalMonitoringExample": "test_output:VariantCallingCarrotOrchestrated.NIST_representative_benchmarking",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_evalRuntimeSummaries": "test_output:VariantCallingCarrotOrchestrated.NIST_output_runtimes",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_evalVcf": "test_output:VariantCallingCarrotOrchestrated.NIST_output_vcf",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_evalVcfIndex": "test_output:VariantCallingCarrotOrchestrated.NIST_output_vcf_index",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_truthLabel": "HG002_GRCh38_GIAB",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_truthVcf": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/HG002_GRCh38_GIAB_1_22_v4.2.1_benchmark.broad-header.vcf.gz",
"BenchmarkVCFsHeadToHeadOrchestrated.NIST_truthVcfIndex": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/HG002_GRCh38_GIAB_1_22_v4.2.1_benchmark.broad-header.vcf.gz.tbi",
"BenchmarkVCFsHeadToHeadOrchestrated.hapMap": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/Homo_sapiens_assembly38.haplotype_database.txt",
"BenchmarkVCFsHeadToHeadOrchestrated.refDict": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/Homo_sapiens_assembly38.dict",
"BenchmarkVCFsHeadToHeadOrchestrated.refIndex": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/Homo_sapiens_assembly38.fasta.fai",
"BenchmarkVCFsHeadToHeadOrchestrated.reference": "gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/Homo_sapiens_assembly38.fasta",
"BenchmarkVCFsHeadToHeadOrchestrated.referenceVersion": "HG38",
"BenchmarkVCFsHeadToHeadOrchestrated.stratIntervals": [
"gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/HCR_hg38.bed",
"gs://dsp-methods-carrot-data/test_data/haplotypecaller_tests/LCR_Hg38.interval_list"
],
"BenchmarkVCFsHeadToHeadOrchestrated.stratLabels": [
"HCR",
"LCR"
]
},
"eval_options": {
"read_from_cache": false
},
"test_cromwell_job_id": "167cbd3e-0835-47b5-8325-a853bd98ec9a",
"eval_cromwell_job_id": "43bcefb2-f38b-413d-9b65-06b489e64af1",
"created_at": "2023-06-02T17:26:47.097005",
"created_by": null,
"finished_at": "2023-06-03T03:48:32.144",
"results": {
"CHM controlHCprocesshours": "86.02092777777776",
"CHM controlHCsystemhours": "0.19513888888888892",
"CHM controlHCwallclockhours": "62.28637777777777",
"CHM controlHCwallclockmax": "3.304836111111111",
"CHM controlMonitoringLogs": "gs://dsde-methods-carrot-prod-cromwell/BenchmarkVCFsHeadToHeadOrchestrated/43bcefb2-f38b-413d-9b65-06b489e64af1/call-CHMSampleHeadToHead/BenchmarkComparison/258eacc8-3768-44a8-86dc-1b2b0516a553/call-CONTROLRuntimeTask/monitoring.pdf",
"CHM controlindelF1Score": "0.8724",
"CHM controlindelPrecision": "0.8814",
"CHM controlsnpF1Score": "0.9784",
"CHM controlsnpPrecision": "0.9706",
"CHM controlsnpRecall": "0.9863",
"CHM controlsummary": "gs://dsde-methods-carrot-prod-cromwell/BenchmarkVCFsHeadToHeadOrchestrated/43bcefb2-f38b-413d-9b65-06b489e64af1/call-CHMSampleHeadToHead/BenchmarkComparison/258eacc8-3768-44a8-86dc-1b2b0516a553/call-BenchmarkVCFControlSample/Benchmark/b89e3e0d-4f93-4b2d-9008-041545f2764c/call-CombineSummaries/summary.csv",
"CHM evalHCprocesshours": "87.0306027777778",
"CHM evalHCsystemhours": "0.19828888888888896",
"CHM evalHCwallclockhours": "62.522422222222225",
"CHM evalHCwallclockmax": "3.293238888888889",
"CHM evalMonitoringLogs": "gs://dsde-methods-carrot-prod-cromwell/BenchmarkVCFsHeadToHeadOrchestrated/43bcefb2-f38b-413d-9b65-06b489e64af1/call-CHMSampleHeadToHead/BenchmarkComparison/258eacc8-3768-44a8-86dc-1b2b0516a553/call-EVALRuntimeTask/monitoring.pdf",
"CHM evalindelF1Score": "0.8724",
"CHM evalindelPrecision": "0.8814",
"CHM evalsnpF1Score": "0.9784",
"CHM evalsnpPrecision": "0.9706",
"CHM evalsnpRecall": "0.9863",
"CHM evalsummary": "gs://dsde-methods-carrot-prod-cromwell/BenchmarkVCFsHeadToHeadOrchestrated/43bcefb2-f38b-413d-9b65-06b489e64af1/call-CHMSampleHeadToHead/BenchmarkComparison/258eacc8-3768-44a8-86dc-1b2b0516a553/call-BenchmarkVCFTestSample/Benchmark/6b8eb5cf-ee16-48e8-a24f-de149e2eded2/call-CombineSummaries/summary.csv",
"EXOME1 controlindelF1Score": "0.727",
"EXOME1 controlindelPrecision": "0.632",
"EXOME1 controlsnpF1Score": "0.9878",
"EXOME1 controlsnpPrecision": "0.9815",
"EXOME1 controlsnpRecall": "0.9941",
"EXOME1 controlsummary": "gs://dsde-methods-carrot-prod-cromwell/BenchmarkVCFsHeadToHeadOrchestrated/43bcefb2-f38b-413d-9b65-06b489e64af1/call-EXOME1SampleHeadToHead/BenchmarkComparison/fa676046-ddfe-4ce8-9193-87025fd9a49b/call-BenchmarkVCFControlSample/Benchmark/8c5c120e-b932-47b0-a592-a719021e6bf9/call-CombineSummaries/summary.csv",
"EXOME1 evalindelF1Score": "0.727",
"EXOME1 evalindelPrecision": "0.632",
"EXOME1 evalsnpF1Score": "0.9878",
"EXOME1 evalsnpPrecision": "0.9815",
"EXOME1 evalsnpRecall": "0.9941",
"EXOME1 evalsummary": "gs://dsde-methods-carrot-prod-cromwell/BenchmarkVCFsHeadToHeadOrchestrated/43bcefb2-f38b-413d-9b65-06b489e64af1/call-EXOME1SampleHeadToHead/BenchmarkComparison/fa676046-ddfe-4ce8-9193-87025fd9a49b/call-BenchmarkVCFTestSample/Benchmark/b683956e-cbfb-4550-978b-cd6a28bf12a4/call-CombineSummaries/summary.csv",
"NIST controlHCprocesshours": "103.80716111111109",
"NIST controlHCsystemhours": "0.20777777777777773",
"NIST controlHCwallclockhours": "76.1228972222222",
"NIST controlHCwallclockmax": "4.163775",
"NIST controlMonitoringLogs": "gs://dsde-methods-carrot-prod-cromwell/BenchmarkVCFsHeadToHeadOrchestrated/43bcefb2-f38b-413d-9b65-06b489e64af1/call-NISTSampleHeadToHead/BenchmarkComparison/24ad1003-6862-4e29-9d4d-ea8e85bcc78b/call-CONTROLRuntimeTask/monitoring.pdf",
"NIST controlindelF1Score": "0.9902",
"NIST controlindelPrecision": "0.9903",
"NIST controlsnpF1Score": "0.9899",
"NIST controlsnpPrecision": "0.9887",
"NIST controlsnpRecall": "0.9911",
"NIST controlsummary": "gs://dsde-methods-carrot-prod-cromwell/BenchmarkVCFsHeadToHeadOrchestrated/43bcefb2-f38b-413d-9b65-06b489e64af1/call-NISTSampleHeadToHead/BenchmarkComparison/24ad1003-6862-4e29-9d4d-ea8e85bcc78b/call-BenchmarkVCFControlSample/Benchmark/7d69a7b4-2884-4b7e-9bce-fc2eab77b125/call-CombineSummaries/summary.csv",
"NIST evalHCprocesshours": "103.71990555555556",
"NIST evalHCsystemhours": "0.20632500000000004",
"NIST evalHCwallclockhours": "76.41897222222222",
"NIST evalHCwallclockmax": "4.163391666666667",
"NIST evalMonitoringLogs": "gs://dsde-methods-carrot-prod-cromwell/BenchmarkVCFsHeadToHeadOrchestrated/43bcefb2-f38b-413d-9b65-06b489e64af1/call-NISTSampleHeadToHead/BenchmarkComparison/24ad1003-6862-4e29-9d4d-ea8e85bcc78b/call-EVALRuntimeTask/monitoring.pdf",
"NIST evalindelF1Score": "0.9902",
"NIST evalindelPrecision": "0.9903",
"NIST evalsnpF1Score": "0.9899",
"NIST evalsnpPrecision": "0.9887",
"NIST evalsnpRecall": "0.9911",
"NIST evalsummary": "gs://dsde-methods-carrot-prod-cromwell/BenchmarkVCFsHeadToHeadOrchestrated/43bcefb2-f38b-413d-9b65-06b489e64af1/call-NISTSampleHeadToHead/BenchmarkComparison/24ad1003-6862-4e29-9d4d-ea8e85bcc78b/call-BenchmarkVCFTestSample/Benchmark/aba51ebf-90d5-44fa-8caa-0beb3cf1643b/call-CombineSummaries/summary.csv",
"ROC_Plots_Reported": "gs://dsde-methods-carrot-prod-cromwell/BenchmarkVCFsHeadToHeadOrchestrated/43bcefb2-f38b-413d-9b65-06b489e64af1/call-CreateHTMLReport/report.html"
},
"errors": null
}
|
🥕CARROT🥕 report map stub finishedfor test HaplotypeCaller CARROT Regression Tests (run: 1097e2ec-85ce-4fa8-b59e-fc82a8ea8bed)
|
I've started to take a look at this. Thank you for running the Carrot Tests! I guess you might be right about the score jitter not mattering but man there are enough differences in the HC integration test that i'm a little worried.... |
20 10074806 . G A 2832.06 . AC=2;AF=1.00;AN=2;DP=75;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.38;QD=29.09;SOR=0.840 GT:AD:DP:GQ:PL 1/1:0,71:71:99:2846,213,0 | ||
20 10075043 . T C 2420.06 . AC=2;AF=1.00;AN=2;DP=64;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.64;QD=26.08;SOR=1.107 GT:AD:DP:GQ:PL 1/1:0,61:61:99:2434,184,0 | ||
20 10075168 . C T 3627.06 . AC=2;AF=1.00;AN=2;DP=91;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.37;QD=33.85;SOR=0.997 GT:AD:DP:GQ:PL 1/1:0,88:88:99:3641,264,0 | ||
20 10074716 . G A 2415.06 . AC=2;AF=1.00;AN=2;DP=70;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.67;QD=29.09;SOR=1.214 GT:AD:DP:GQ:PL 1/1:0,64:64:99:2429,192,0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing some mess in the qual scores and here evidently all the QD scores specifically are broken... are you certain there isn't some ugly gotcha hiding in here? Do you have an explanation for exactly where the divergence from this branch is coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember the scary place in AssemblyBasedCallerUtils
where the List of haplotypes to be sorted by their kmer support score had duplicates? When I fixed the duplicate issue I broke all the tests because there are ties all over the place, so the exact (arbitrary) way in which you might modify the input order while filtering duplicates and sorting makes a difference.
The reason there are so many ties is that the score is extremely non-discriminating: as long as a kmer appears in even one read, it counts equally well, so the scoring often comes out like this: a few haplotypes with insertions leading, since longer haplotypes have more kmers with which to run up the score (I saw your comment about maybe normalzing by haplotype length -- your concern was well-placed), followed by a big tie among a lot of haplotypes with insertions.
I don't take broken tests lightly (well, to be honest I take it more lightly than, say, David Roazen but I do basically care about the integrity of the GATK) so I set some conditional breakpoints, and it turns out that the discrepancies only occur when the cutoff haplotype count occurred in the middle of a tie. That is, the differences in QUAL and QD only appear when a tie must be broken arbitrarily.
After seeing this I decided to referee ties more permissively and keep everything with a score at least as good as the cutoff. Thus, in a later commit of this PR, I decided that if the args are set to keep 5 haplotypes but the 4th through 8th have the same score we keep 8, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay thank you for this in-depth explanation. When I had written this comment out I had not yet internalized that these mismatches ONLY occur on the pileupcaller test files which is considerably lower impact than I had imagined when I saw the number of test files mismatching. That code in particular is less well tested/supported so our threshold for alarm is lower. The primary impact of this part of the codebase is relevant to DRAGEN-GATK which you are about to take ownership of anyway, so I fully expect anything that might be broken here to have a high chance of getting run to ground later in your process. I accept your explanation.... sorry to raise the alarm bells
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will caution however. I see these issues in the DRAGEN results as well. Are those changed because the test file is triggering the GGA-LIKE fallback here? I wouldn't expect the standard dragen model to care about ordering in that code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double-checked this morning by reverting just the haplotype sorting and every DRAGEN test matches master. It's less elegant because you have to introduce a List
, then stream it and call distinct()
, instead of using a Set
from the beginning.
...tute/hellbender/tools/haplotypecaller/expected.testForceCallingNotProducingNoCalls.gatk4.vcf
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall i think Event is much cleaner than the previous code and I quite like the changes you've made... I found one place where some unit tests changed because you changed the nature of a method test but mostly its suylistic/question comments and nothing too big. You are right this looked a lot worse than it was in reality...
} | ||
|
||
// only use this for debugging, file output etc -- creating a full VariantContext is slow and defeats the purpose of this slim class | ||
public VariantContext asVariantContext() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "@VisibleForTesting" and possibly make it package-protected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also rename this to something scarier to make it easier to be srue this isn't misused "asVariatnContextForDebugging()" or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I eliminated the sourceless one and renamed the other convertToVariantContext
in order to make clear that this is not cheap.
|
||
public static Event of(final VariantContext vc) { | ||
Utils.validateArg(vc.isBiallelic(), "variant must be biallelic"); | ||
return new Event(vc.getContig(), vc.getStart(), vc.getReference(), vc.getAlternateAllele(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this pull in the attributes from the VC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly make it configurable>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any need in my code, but I renamed it to ofWithoutAttributes
so people will know.
@@ -48,6 +49,12 @@ public Map<String, Object> annotate(final ReferenceContext ref, | |||
return Collections.unmodifiableMap(map); | |||
} | |||
|
|||
public static Pair<List<Integer>, byte[]> getNumTandemRepeatUnits(final ReferenceContext ref, final Event event) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gross....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unavoidable i guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I could refactor the tandem repeat code but it would be hard to justify that as a contribution to DRAGEN GATK.
allelePair.getLeft().getRefAllele(), | ||
allelePair.getLeft().getAllele(), | ||
allelePair.getLeft().getLoc()), | ||
allelePair.getLeft().refAllele(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not your responsiblity here but we should maybe take another stab at refactoring this ultima code...
pos12.getEventMap().put(2, vc2); | ||
pos12.getEventMap().put(4, vc5); | ||
pos12.getEventMap().put(5, vc5); | ||
pos12.setEventMap(EventMap.of(vcToEvent(vc1), vcToEvent(vc2), vcToEvent(vc4), vcToEvent(vc5))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so much nicer
/** | ||
* Remove haplotypes with alleles we wish to filter | ||
* // TODO this is a bad algorithm for bad people -- it might eliminate good alleles on the same haplotypes | ||
* // TODO: this should be a method of AssemblyResultSet, not an external static method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outdated comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took out the second TODO. But as I understand your comment that it is a "bad algorithm for bad people" remains true.
} else { | ||
Utils.validateArg(!vc2.isSNP(), () -> "vc1 is " + vc1 + " but vc2 is a SNP, which implies there's been some terrible bug in the cigar " + vc2); | ||
} | ||
protected static Event combineEvents(final Event e1, final Event e2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A term i've seen for these are "compoundEvents" which i think is useful for parsing what this method does since its a little confusing by the name (since bialeleic event containers don't seem like they SHOULD be combinable?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed it as makeCompoundEvents
and expanded javadoc.
return new EventMap(getEvents(haplotype, ref, haplotype.getLocation(), maxMnpDistance)); | ||
} | ||
|
||
public static EventMap of(final Event ... events) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
visibleForTesting (for now) however you should think on how the PartiallyDeterminedHaplotype() uses these event maps when asked... you might want to use this constructor/something like it given that we keep generating artificial haplotypes as part of that code and it might not be... necessary... a good idea... FE... to have to re-align them given that they are artificial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be easier to have it explicitly make the event map on construction from the constituent events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VisibleForTesting for now. I will ponder further as I get deeper into the PartiallyDeterminedHaplotype code.
} | ||
}); | ||
return results; | ||
public static List<VariantContext> getEventsFromActiveHaplotypes(final int loc, final List<Haplotype> haplotypes, final boolean includeSpanningEvents) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a misnomer given that this is the exit point for returning VCs from the events...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
.collect(Collectors.toMap( | ||
Map.Entry::getKey, Map.Entry::getValue, (e1, e2) -> e1, LinkedHashMap::new)); | ||
// sort by score = # of kmers in haplotype that appear in any read | ||
// TODO this code might use some normalizing based on haplotype length in the future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly we should just make a ticket in github to revisit this entire codepath at some later date... i don't like this heuristic and i suspect we could do much better... thank you for not cutting too deeply here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also i will warn you this code WAS a performance hotspot so you might want to consider if any of these streams are going to be heavy duty.... Particularly anything looping over kmers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be a problem because it's only one stream per haplotype. If it were streaming over every kmer (or of course if the CARROT tets showed a performance regression) I would be more worried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall i think Event is much cleaner than the previous code and I quite like the changes you've made... I found one place where some unit tests changed because you changed the nature of a method test but mostly its suylistic/question comments and nothing too big. You are right this looked a lot worse than it was in reality...
@jamesemery back to you. |
src/main/java/org/broadinstitute/hellbender/utils/haplotype/Event.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactor looks good to me. I would consider (mostly for your own sake) running this through the FE workspace one more time since there were non-zero changes to the dragen pipeline just to re-set your baseline. Otherwise this seems alright to merge (one comment)
@jamesemery This is a warmup PR for DRAGEN stuff, a bit of housekeeping of code at the margins of partially determined haplotype logic.