Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GT header in PostprocessGermlineCNVCalls's --output-genotyped-intervals output #8621

Merged
merged 2 commits into from
Dec 14, 2023

Conversation

jmarshall
Copy link
Contributor

PostprocessGermlineCNVCalls produces a GT header like this in one of its output files:

##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype">

which should have been Type=String. Incorrectly writing this as Type=Integer produces invalid VCF files and in particular causes bcftools to misparse records' genotype fields.

I searched for other similar problems and noted the inconsistencies in ReblockGVCFUnitTest.java. However this probably isn't causing any actual problems, so you may wish to include only the first commit — which very much is causing production problems.

Incorrectly writing this as Type=Integer causes bcftools to misparse
the genotype field.
@droazen droazen requested a review from ldgauthier December 13, 2023 05:17
@droazen droazen self-assigned this Dec 13, 2023
@droazen droazen requested a review from mwalker174 December 13, 2023 19:30
Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jmarshall the gCNV change LGTM. The first of many steps towards spec compliance in the GT field actually.

Would it be possible for us to merge after today's release to give users a chance to reconcile any downstream issues this introduces? I would imagine this could break some analysis scripts e.g. using pysam.

@jmarshall
Copy link
Contributor Author

When to merge is your decision, but note that:

  • Current production pipelines using bcftools are already broken by this
  • Pysam currently misreads GT when given this header, so pysam-using analysis scripts are probably also already broken and/or contain workarounds. e.g. a basic for v in pysam.VariantFile(fname): print(v) loop produces the following (note the -65 instead of . for GT):
1	100	.	C	T	.	PASS	AN=4	GT:GQ	-65:245

@droazen droazen assigned lbergelson and mwalker174 and unassigned droazen and lbergelson Dec 14, 2023
@droazen
Copy link
Contributor

droazen commented Dec 14, 2023

@mwalker174 Since GATK 4.5 is now out, you can hit merge on this whenever you're comfortable.

@lbergelson lbergelson merged commit fd873e9 into broadinstitute:master Dec 14, 2023
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants