Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with split_on_adapter output #47

Open
itslittman opened this issue Apr 27, 2023 · 17 comments
Open

issue with split_on_adapter output #47

itslittman opened this issue Apr 27, 2023 · 17 comments

Comments

@itslittman
Copy link

itslittman commented Apr 27, 2023

I successfully ran split_on_adapter with the --allow-multiple-splits option and it split a bunch of reads for me. When converting the new mapped SAM file to BAM format, it wrote about 570mb successfully and then failed with this error:

[E::aux_parse] B aux field type not followed by ','
[W::sam_read1_sam] Parse error at line 413764
samtools view: error reading file "sample_name.mapped.sam"

My workflow:

  1. basecall with Dorado (400bps DNA HAC w/5mC and 5hmC)
  2. convert unmapped SAM to FASTQ with samtools fastq -T '*'
  3. split_on_adapter with --allow-multiple-splits option
  4. map with minimap2
  5. convert output to BAM

This is the offending line:

789cdf07-b995-461e-84fa-bf4ef4c62f1e_1 16 chr1 151237868 60 157M1D3M1D73M35S * 0 0 ATTCCCAAAATTGCTGAGTAGTGGCAATTTTAGATTCTCTTTGGTGGAATCAGAGTGGAAGAGGTAGGCAAGAAGATTTGGAGAAAACTAGATTATAATACATACTGTAGAGAGTTCCTGGGGTTAGAGGAAGGATCTCATTTTCTCCTGTTTTTTTATGATTTTTTTCTCTTTTTGTTTTCTTGATCACTTATTATCTGACCTTCTGGTTTATGGAGGATGAGGCAGTTATGAGCAATATGATGGAACCAGGTACTAACATAAACAG @@>;>9:::A?=-,**///44564455E7200,,.02056;<>A==><=>==;;88@;?=<;;<;;<=BB@<<<=CB@=844..--,--55;<=D<<;<>=?>>>>@acs>>?>;9:=GHEBB42<==@?<;;<?CBCEHKJ@=:::;;<=>?<70.--.<;<DGJ8HB=ACB>;97732(&&&(<;<:9;<>9568899<@=D>===?CBA76<;;;<@>><<99:90//)()+.-'&&%%&&+'''&%%'%##$$'&%%&$##" NM:i:2 ms:i:454 AS:i:454 nn:i:0 tp:A:P cm:i:40 s1:i:216 s2:i:0 de:f:0.0085 rl:i:15 NM:i:4 ms:i:1108 AS:i:1108 nn:i:0 tp:A:P cm:i:84 s1:i:487 s2:i:0 de:f:0.007 rl:i:114 qs:i:15 du:f:1.43725 ns:i:5749 ts:i:10 mx:i:2 ch:i:349 st:Z:2023-04-14T03:12:29.59+00:00 rn:i:25470 f5:Z:FAW60540_pass_9382fa45_a155afd1_502.pod5 sm:f:91.767 sd:f:26.724 sv:Z:quantile MM:Z:C+h?;C+m?; ML:B:C 0->268

Any ideas?

Noah

@itslittman
Copy link
Author

Okay, so there was a smaller truncated pod5 written by the device (a software crash occurred during this run) that I included in the above sample data. I thought maybe that was the issue, so I omitted it and made a new FASTQ. I repeated everything I did above but got an extremely similar error:

[E::aux_parse] Incomplete aux field
[W::sam_read1_sam] Parse error at line 197
samtools view: error reading file "new_sample.sam"

The offending line is:

125659e2-2885-4568-b285-f3304db0097e 16 chr4 68049456 60 16S119M1I32M1I93M1I52M3I130M1I70M1D5M3D70M41S * 0 0 CACCATTGTGATTTTACTTCTGTTTCCTTAGGTATAGTTGAATATGGTTTTAAGTAGGTTGTTGTCTTAATCATAAAACTGGTTTGGTCCAACAATATACTTTGCAGAACACCATGTAAGTTTAGCTTACTTCACTTTTTTTTTTTTTTTTGGCATCCAATTGATGTGAATATAAGATCTTGCAATTCTTATTATTCAATTACAGAAATTAAGCATAATTTACTTACTTAGGGCTTTGAGGATTCTTGGTCTGATTTAACCTGAAGTTTCTAGTTTAATATCTAAAATATTGGTGAGATTGGCAGGAATAGGGCTGAAGAGGTAATAATGAATGGTGGTAGGAGGAGGATCTGTTTTATGTTTTAAAATTACAATTTTAATGTTATTAGTTGTTGGGAACATCATTAGTGATGTGGAATAAATTAGATGTATATAAAAGTACAGGTTGTAGAAGAGCTATGATAGCAATGAATGTAAGGATAATGAGGCTATTATTTTTTGTTATTTCTTGAATCATAGTCATTGTAAAAATCCTGTTAGCAGGGGCAGTTCTCCTAAGGATAGTAGAATAACAAGGATTATAGATGTTAGTGAACAATGTTTGAGTTGTTCAAGGTATAGGCATAACACCAGAG %&&'(%%&(('''),0667>CE:9<:;2176798&&&'(.1299<>ED@?@A>;;;<;;;;==??CA==<=@@b;;:2385::953332;>AA@D:::?@,,..005558763333559>:6:::<;;:;322;=?CFDEJGDCBB::64777989@@<644(((/(75;@==?<;<;<4++**+,---...0D::::?@bef@90.-,.97:72.--/.16779;89:D=:78::=?@@??577345569:@=6+((,>?@fc@=>?@FDDBA>>?@JNIFEEC<;::::;=C=:====AA@=<<;<=5*('(3:=>>CDCC=<<<<<<;===@?=9889898<<=BD>?888;CIBBBIFHF????JKOFD6=?;77-,,,.//0756612888==<;++++89<:;8=6688@C<8833336ISJCCGE--.=::;;<))(+,+)&(++-/<==>=>?AB>334*)+*-/117677897767;>>>DEEL988;?>?A@<8877('''3)'&('&#$'+++,--/4>>>=@=8777:==>=>=;92108==>*)))3++,++++-;<?@acc??@@CBA512/'''((##$%'%$$&$%&%$%%))'%$%$%&$&'&)&$###$"#" NM:i:17 ms:i:1060 AS:i:1056 nn:i:0 tp:A:P cm:i:80 s1:i:477 s2:i:0 de:f:0.0225 rl:i:15 None

I would note that the unsplit FASTQ produced no errors whatsoever after and looked perfect in IGV (other than all the duplex reads I'm trying to separate).

@ollenordesjo
Copy link
Contributor

Hi @itslittman, thanks for the question.

To me it looks like the trailing None is quite suspicious, it would be great if it's possible to see if that's part of the original (unsplit) file. I'm noting that 125659e2-2885-4568-b285-f3304db0097e doesn't have a trailing _1 or _2 in it, so it shouldn't have been split.

Cheers

@itslittman
Copy link
Author

itslittman commented Apr 29, 2023

@ollenordesjo The first offending line doesn't have a None at the end and does have a '_1' in the readname - does that one have anything else that looks fishy or could the modified base tags be causing an issue? Should I run split with the 'debug_output' option?

Noah

@itslittman
Copy link
Author

@ollenordesjo I pulled the line of the unsplit read from the mapped SAM file (the one I generated before realizing there were a lot of concatenated reads):

125659e2-2885-4568-b285-f3304db0097e 16 chr4 68049456 60 16S119M1I32M1I93M1I52M3I130M1I70M1D5M3D70M41S * 0 0 CACCATTGTGATTTTACTTCTGTTTCCTTAGGTATAGTTGAATATGGTTTTAAGTAGGTTGTTGTCTTAATCATAAAACTGGTTTGGTCCAACAATATACTTTGCAGAACACCATGTAAGTTTAGCTTACTTCACTTTTTTTTTTTTTTTTGGCATCCAATTGATGTGAATATAAGATCTTGCAATTCTTATTATTCAATTACAGAAATTAAGCATAATTTACTTACTTAGGGCTTTGAGGATTCTTGGTCTGATTTAACCTGAAGTTTCTAGTTTAATATCTAAAATATTGGTGAGATTGGCAGGAATAGGGCTGAAGAGGTAATAATGAATGGTGGTAGGAGGAGGATCTGTTTTATGTTTTAAAATTACAATTTTAATGTTATTAGTTGTTGGGAACATCATTAGTGATGTGGAATAAATTAGATGTATATAAAAGTACAGGTTGTAGAAGAGCTATGATAGCAATGAATGTAAGGATAATGAGGCTATTATTTTTTGTTATTTCTTGAATCATAGTCATTGTAAAAATCCTGTTAGCAGGGGCAGTTCTCCTAAGGATAGTAGAATAACAAGGATTATAGATGTTAGTGAACAATGTTTGAGTTGTTCAAGGTATAGGCATAACACCAGAG %&&'(%%&(('''),0667>CE:9<:;2176798&&&'(.1299<>ED@?@A>;;;<;;;;==??CA==<=@@b;;:2385::953332;>AA@D:::?@,,..005558763333559>:6:::<;;:;322;=?CFDEJGDCBB::64777989@@<644(((/(75;@==?<;<;<4++**+,---...0D::::?@bef@90.-,.97:72.--/.16779;89:D=:78::=?@@??577345569:@=6+((,>?@fc@=>?@FDDBA>>?@JNIFEEC<;::::;=C=:====AA@=<<;<=5*('(3:=>>CDCC=<<<<<<;===@?=9889898<<=BD>?888;CIBBBIFHF????JKOFD6=?;77-,,,.//0756612888==<;++++89<:;8=6688@C<8833336ISJCCGE--.=::;;<))(+,+)&(++-/<==>=>?AB>334*)+*-/117677897767;>>>DEEL988;?>?A@<8877('''3)'&('&#$'+++,--/4>>>=@=8777:==>=>=;92108==>*)))3++,++++-;<?@acc??@@CBA512/'''((##$%'%$$&$%&%$%%))'%$%$%&$&'&)&$###$"#" NM:i:17 ms:i:1060 AS:i:1056 nn:i:0 tp:A:P cm:i:80 s1:i:477 s2:i:0 de:f:0.0225 rl:i:15 qs:i:12 du:f:1.24775 ns:i:4991 ts:i:10 mx:i:2 ch:i:8 st:Z:2023-04-12T23:10:51.660+00:00 rn:i:12959 f5:Z:FAW60540_pass_a6ad0ce3_11d41050_402.pod5 sm:f:92.951 sd:f:27.937 sv:Z:quantile MM:Z:C+h?;C+m?; ML:B:C

@ollenordesjo
Copy link
Contributor

Hi @itslittman,

Yes, it does indeed look like some tags have been removed:
qs:i:12 du:f:1.24775 ns:i:4991 ts:i:10 mx:i:2 ch:i:8 st:Z:2023-04-12T23:10:51.660+00:00 rn:i:12959 f5:Z:FAW60540_pass_a6ad0ce3_11d41050_402.pod5 sm:f:92.951 sd:f:27.937 sv:Z:quantile MM:Z:C+h?;C+m?; ML:B:C

Having a closer look, it looks like adding the span of bases from which the read was extracted may have caused a problem:

https://github.com/nanoporetech/duplex-tools/blob/master/duplex_tools/split_on_adapter.py#L177

We add the {start}->{end} information after the comments, so it could be the case that attempting to parse this as a tag doesn't work. I think this is the issue.

If you either remove this data from the files, and then proceed with your steps 4 and 5 from your original comment, I think you might get it working. In that case, I'll put a TODO to either move or remove this additional metadata in the fastq comments.

Cheers!

@itslittman
Copy link
Author

@ollenordesjo how do I remove this data?

@itslittman
Copy link
Author

@ollenordesjo also how are other people getting good results without doing this if the start/end comment is automatically written?

Noah

@ollenordesjo
Copy link
Contributor

Hi @itslittman, sorry for late reply.

The easiest way to remove it from files you already have is probably this (which will remove -> from the sam file)

sed 's/ [0-9]\+->[0-9]\+$//' yoursam.sam

I think it's likely that people have not hit this issue if they haven't passed the tags from the usam through to mapping.

Would it be possible to share any flags you used for minimap for the workflow below that you shared? I'm guessing you're using -y, but would be great to know the exact process so we're not missing any details when writing the tests

basecall with Dorado (400bps DNA HAC w/5mC and 5hmC)
convert unmapped SAM to FASTQ with samtools fastq -T '*'
split_on_adapter with --allow-multiple-splits option
map with minimap2
convert output to BAM

@itslittman
Copy link
Author

itslittman commented May 11, 2023

@ollenordesjo okay thanks i'll try that. Also the only other flag I'm using with minimap2 is -Y, so that everything is soft clipped (no hard clipping). Makes it easier for me to BLAT stuff (I'm dealing with fusion genes in leukemia samples).

Noah

@itslittman
Copy link
Author

itslittman commented May 11, 2023

@ollenordesjo the sed command did not work though - same error.

Edit: I ran split with the non-modified-basecalling files output during the original run and everything converted to BAM fine. So it seems like the -y flag is to blame. Yet removing the comments the way you suggested didn't work - could it be MM and ML tags themselves that are causing the problem somehow? Even though they don't cause a problem when reads are not split?

Noah

@itslittman
Copy link
Author

@ollenordesjo could my Samtools version be an issue?

@ollenordesjo
Copy link
Contributor

Hi @itslittman, sorry for responding earlier. Not sure if the samtools version could be the issue. Which version have you tried? If you're able to upload the sam that is showing the issue I'm happy to see if I can spot the problem.

@itslittman
Copy link
Author

@ollenordesjo It's all good!

I'm using Samtools version 1.16.1 and htslib 1.16, which are from 2022 so pretty recent. How would I give you access to the SAM? It's like 30GB so even if I wanted to post it right on here I don't think I could. Would a dropbox link work?

@ollenordesjo
Copy link
Contributor

Yep, samtools version shouldn't be the problem then. If you send an email to me on ollenordesjo@nanoporetech.com I can send you a link where you can upload it

@itslittman
Copy link
Author

@ollenordesjo it's saying undeliverable to that address

@ollenordesjo
Copy link
Contributor

Ah, so sorry! It's olle [dot] nordesjo@nanoporetech.com

@Rasinj
Copy link

Rasinj commented Jul 31, 2023

Hi @itslittman, received the sam and can confirm that it's possible to fix the offending lines with the following:

sed -E 's/ [0-9]+->[0-9]+$//g' split.head.sam > split.fixed.sam

Can you let me know if this fixes the problem?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants