-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parse strange contig names correctly in the commandline -L argument #1438
Comments
This will be tricky, as GATK4, unlike GATK3, does not require that a sequence dictionary be present (eg., tools can take just a vcf, for example). It is a bit crazy that these problematic characters are forbidden in the VCF spec but not the SAM spec... |
Re-assigning to @cmnbroad for the 4.0 milestone. |
#4093 detects ambiguities, but throws when it finds them. Reopening this to keep the history, since we should still probably invent some kind of quoting mechanism to allow the user to resolve ambiguities. |
This is done well enough for hg38 purposes -- we can open a separate ticket if we want to go further with a quoting mechanism. |
Charlotte recently stepped on a bug in GATK: It interprets -L argument that ends in the regex ':[0-9]*' as indicating a single site in the contig that precedes it and then barfs if it cannot find that contig in the dictionary. In hg38 we have contig names like 'HLA:01:01:01' and when used on the command-line (as in CreateRealignerTargets) it barfs as in the following workflow: https://picard.broadinstitute.org/pipeline/workflows/viewWorkflow/8536444
given that the SAM spec allows any printed character ! through ~ in the ending of contig names (yikes!!) samtools/hts-specs#124 it seems that some more "smarts" needs to be put into the parsing of this argument.
The text was updated successfully, but these errors were encountered: