Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence name regex issue in SortSam #1574

Closed
rwinand opened this issue Sep 3, 2020 · 1 comment
Closed

Sequence name regex issue in SortSam #1574

rwinand opened this issue Sep 3, 2020 · 1 comment

Comments

@rwinand
Copy link

rwinand commented Sep 3, 2020

Bug Report

Affected tool(s)

  • SortSam

Affected version(s)

  • 2.23.3

Description

When I try to run SortSam on my file, I get the following error:

Exception in thread "main" htsjdk.samtools.SAMException: Sequence name 'ACol-X0348-2020-VARS(H06)-S2' doesn't match regex: '[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*' 
	at htsjdk.samtools.SAMSequenceRecord.validateSequenceName(SAMSequenceRecord.java:210)
	at htsjdk.samtools.SAMSequenceRecord.<init>(SAMSequenceRecord.java:93)
	at htsjdk.samtools.SAMTextHeaderCodec.parseSQLine(SAMTextHeaderCodec.java:214)
	at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:113)
	at htsjdk.samtools.SAMTextReader.readHeader(SAMTextReader.java:216)
	at htsjdk.samtools.SAMTextReader.<init>(SAMTextReader.java:63)
	at htsjdk.samtools.SAMTextReader.<init>(SAMTextReader.java:73)
	at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:434)
	at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:208)
	at picard.sam.SortSam.doWork(SortSam.java:152)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:301)
	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
	at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
Exit status: 1

This error is caused by the '()' characters in the sequence name. I found that for some tools, there is the paramater READ_NAME_REGEX that can be used but that is not a valid parameter for this tool.

For the moment I went back to version 2.8.3 because there it still works.

Is it possible to change the regex that is used for this on the command line? If not, is there a reason that the parentheses were not included in the default regex because otherwise, I could make try to make a pull request where they are added?

Steps to reproduce

Run the following command on any file where the sequence name contains parentheses:

picard SortSam I=readmap.sam O=sorted.bam SORT_ORDER=coordinate CREATE_INDEX=true

Expected behavior

Output a sorted BAM file with index

Actual behavior

Exception due to parentheses in the sequence name

@rwinand
Copy link
Author

rwinand commented Sep 3, 2020

After looking into this some more, it found that the issue is a direct result of the SAM specification (e.g. htsjdk issue #1295) and that it is not a issue that can be solved in Picard. Therefore, I will close the issue here and find a workaround or raise an issue at hts-specs.

@rwinand rwinand closed this as completed Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant