Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BWA is not supported [as BWA does not allow Ns in the index] #19

Closed
RoseString opened this issue Mar 21, 2018 · 5 comments
Closed

BWA is not supported [as BWA does not allow Ns in the index] #19

RoseString opened this issue Mar 21, 2018 · 5 comments

Comments

@RoseString
Copy link

Hi @FelixKrueger ,

Thank you for the great tool! It is a great help to my research.

I have a question. Is it possible to add BWA support? Our lab is interested in doing population genetics with split reads. Basically, we will follow the GATK pipeline which recommends using BWA for read mapping. I have tried running bwa (with soft-clipping option off) followed by SNPsplit, but it failed. So I guess BWA output is not yet supported.

Also I'm wondering in a future release if we'll be able to run the tool with soft-clipping on? That would increase the mapping efficiency.

Thank you very much!

@FelixKrueger
Copy link
Owner

Hi @RoseString,

Thanks for your comments. I have personally not used BWA before (we seem to be one of the few places in the UK using Bowtie2...), but I imagine that BWA output should work with SNPsplit as well as long as it contains an MD:Z: string. Similarly, STAR requires a few special options to output this string, as is mentioned here:

SNPsplit requires the MD:Z: field of the BAM alignment to work out mismatches involving
masked N positions. Since STAR doesn’t report the MD:Z: field by default it needs to be
instructed to do so, e.g.:    
    --outSAMattributes NH HI NM MD

If you are familiar with BWA and could look into whether there are any options to output the MD:Z: string, else I can probably also take a look at some point.

I am afraid supporting soft-clipping would be a major addition to the code, and I can't see this happening any time soon.

Cheers, Felix

@RoseString
Copy link
Author

Hi @FelixKrueger ,

Sorry for the delayed reply. I got really busy last few days.. Thanks for your willingness to help! Below is some info for a test run.

The bwa mem command I used was :
bwa mem -L 1000,1000 ref/ref_N-masked.fa reads1.fq reads2.fq > out.sam
(-L 1000,1000 was to turn off the soft-clipping, as recommended in the SNPsplit documentation)

Here are some example lines from this run. It seems MD:Z: field is automatically turned on, but other attributes like NH and HI are missing? I don't know if that's causing the issues. Thanks again!

HWI-ST330:350:H0WYEADXX:2:2211:4093:68465 147 NW_005082037.1 3820 60 148M = 3558 -410 AATCACAATGTTATAAATAAATAAAGTAATTAATAAATTAAGGGATTAAAAATGAAATAATAAAATAAATCATACCCATATGGAGGAAATAAATAAAACTGAATAAATTAATAAATAAAAATAAAGTAAGGGAAAATAAAAAAGAAAA DCCCC:A>4:>CEDEDEEEDC>EDCA>DCCDDFDEEDCDDCC@>C>=CAEEC@EFFFEFFCFHEEAHHE=CE@E@FAF@GCGBHEIJJIHJIGBJIIGBBCIGBJIGBGIIHJJJGJJJJJJJJJJJJIJJJJJJJIHHHHHFFFFFC NM:i:8 MD:Z:5T28T3A4A3T10C2G3G82 AS:i:108 XS:i:62
HWI-ST330:350:H0WYEADXX:2:1209:5246:70463 99 NW_005085055.1 1580 41 1M4I143M = 1957 524 TCCAAAAAATCACCAGCAATGTGTTCTATAATTAGCTGACTAAATACATAACTATAGCAAAGCAAGCAAATCAATTGTCTCTCCTTCTCTTCTCTCTTAGTTACTTTCCACCATGAGTTAAAGCAGAGTGTGCAAACAATACATATCT BFFFFFHHHHHJJJIJJJJJJJJIHGJJIJJJJJJJJIJIIJJJJJIJIJJJIJIIJJJJJJIJJJJJJHHFHEEFFFBEDCCEEEEDDDDCDDDDDDDDDDDDDDEDDEDDDDDCDDEDDCCDCDCDDCDDDDCCBDDDDDCEEDDD NM:i:8 MD:Z:9T38A8A85C0 AS:i:127 XS:i:125 XA:Z:NW_005082193.1,+94222,2M1D146M,6;NW_005081657.1,+235965,2M1D146M,9;NW_005082263.1,+32990,83M3I62M,10;
HWI-ST330:350:H0WYEADXX:2:1209:5246:70463 147 NW_005085055.1 1957 40 57M1I90M = 1580 -524 CAAGGGAATTGCGGTCCGCAGCCACTCCAGGATGCTTAGCTCCTTCTTAATAAATCAGGGGGTAATATAAATCATGTTCCAACAATATTTTAGCTACTGTTTGAACTGTAGCCCAAGCTGCTGGAAATGCTTTAACCCACTGAGTTAA DDDDDCAB>0>BB><<.BDBB@ADDDDCC?DDDDCDDBBB?BDDDCB@>CDECDCDDDCCC>FFFFEFD@A=HEEGIHGGIHIJIJIJJIIGIGD?HIIHIED@GJJIGGIIGJIHGHGIIGJJGJIIFJGJIGHJHBDHHGFFFFF@ NM:i:4 MD:Z:18T20T20T86 AS:i:125 XS:i:125 XA:Z:NW_005084917.1,+2079,86M1I61M,4;NW_005082193.1,-94601,57M1I75M1I14M,7;
HWI-ST330:350:H0WYEADXX:2:2105:2980:28194 99 NW_005081615.1 2320121 60 148M = 2320555 510 TGGATAGTTTTCACCAATTTTCCCCCCATTATAGAACTCTATATGCTCATGAAAAAAAACAAACATTACTTGAAAGAGAGATTTCCCTATTTTGCCCAGCTGAGAGCTCAGAAATCTTGCATGGGAGTGCAAAACACTTCTCGTGGAT CFFFFFHHHHHJJJJJJJJJJIJJJJJGHIIIJJJJJJJJGIJJJJJJJJIJIJJJJHFFDFEDCEEDDDDDDDDDDDDD?CCDDEDDDDDEECCDDDBACCCAA@ACDDDDDD+:>CCCDDDDDDD?BDDDDDDBDDDDDCBDDDBC NM:i:5 MD:Z:56T57G14T0G5T11 AS:i:123 XS:i:20
HWI-ST330:350:H0WYEADXX:2:2105:2980:28194 147 NW_005081615.1 2320555 60 76M = 2320121 -510 GGTAAGAGCAATGAGTCAAACACAAAAAAAGTGGGCAAACACAAAAACAGTGTGGTTTTTTTTCTCTTAACTTTGT >+(DDDDDDDDDDDDDDDCCDDDDFHJJJJIIHJHFIIHHFIJJJIGJJJJJHGJJJJJJJJJJHHHHHHFFFFFC NM:i:2 MD:Z:2C2C70 AS:i:70 XS:i:0

@FelixKrueger
Copy link
Owner

Thanks for that. I'll be going skiing soon so I'm not exactly sure when I will get to this. But I won't forget about it! Cheers, Felix

@FelixKrueger
Copy link
Owner

Hi @RoseString ,

I just had a look at this issue again. First, I indexed a CAST/B6 genome with bwa index, and then I aligned some CAST/B6 data to that genome with a command line very similar to yours:

bwa mem -t 8 -L 1000,1000 /bi/scratch/Genomes/Mouse/Castaneus_based_on_GRCm38.v5/CAST_EiJ_N-masked/BWA_B6_CAST.fasta file_L001_R1_val_1.fq.gz file_L001_R3_val_2.fq.gz. I was puzzled by two facts:

  1. After filtering out non-primary and secondary alignments, I was very surprised to see that the first 2 alignments I was soft-clipped to more than 70% of the read length, despite supplying already adapter/quality trimmed reads and setting the penalty for clipping to -L 1000,1000:
NB501547:136:HLT3GBGX5:1:11101:15723:1049       163     4       110215658       40      19M47S  =       110215707       123     ATATCACCTCACTCAGCAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGNGNANACATCTTG      AAAAAAEEEEEEEEEEEEE################################EE#E#E#AEEEEAEE   NM:i:0  MD:Z:19 AS:i:19 XS:i:0
NB501547:136:HLT3GBGX5:1:11101:17935:1050       99      2       137454998       60      75M     =       137455141       163     CCTCANGTCCATCATCTAGATCATAAGGGTTTCCTCAGCCATTGAAACTCTCCAGTACTNTTTNTNNTNGAAATT     AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE#EEE#E##A#AEEEEE  NM:i:7  MD:Z:5T5G47C3C1A0A1T6   AS:i:58 XS:i:0
NB501547:136:HLT3GBGX5:1:11101:17935:1050       147     2       137455141       46      55S20M  =       137454998       -163    ACTNNNGANATCATGCCNCAGNTTNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGTCTGGCTTCCTGTGGTCA     EEA###AE#EEEEEEEE#EEE#EE#E#############################EEEEEEEEEEEEEEEAAAAA  NM:i:0  MD:Z:20 AS:i:20 XS:i:0
  1. When proceeding with SNPsplit on some 1M mapped reads (command was SNPsplit --snp all_SNPs_CAST_EiJ_GRCm38.txt.gz --no_sort --paired), I found that it ran through just fine (not sure what you meant when you said it failed). It produced the following SNP coverage report:
===================
SNP annotation file:    all_SNPs_CAST_EiJ_GRCm38.txt.gz
SNPs stored in total:   20668547
N-containing reads:     0
non-N:                  995273
total:                  999687

Which I found very odd, as it seems to suggest that no reads covered any of the N-masked positions at all. I then tried the same thing with Bowtie2, and this is the result:

SNP coverage report
===================
SNP annotation file:    all_SNPs_CAST_EiJ_GRCm38.txt.gz
SNPs stored in total:   20668547
N-containing reads:     331645
non-N:                  645851
total:                  999976

...

Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total:               493533
        thereof were read pairs:                        484171
        thereof were singletons:                        9362
Reads were unassignable (not overlapping SNPs):         253607 (51.39%)
        thereof were read pairs:        247347
        thereof were singletons:        6260
Reads were specific for genome 1:                       124313 (25.19%)
        thereof were read pairs:        123590
        thereof were singletons:        723
Reads were specific for genome 2:                       114384 (23.18%)
        thereof were read pairs:        112067
        thereof were singletons:        2317
Reads contained conflicting SNP information:            1229 (0.25%)
        thereof were read pairs:        1167
        thereof were singletons:        62

So with Bowtie 2 I get ~50% allele-specific reads, but with BWA I get 0% for the exact same data! I then did some quick Googling how BWA handles Ns during the indexing procedure, and I found evidence that it randomly replaces N with C, A, T or G, please don't ask me why this would be a good idea...

So I am afraid the answer to your original question is probably:
No, BWA does not support alignments to genome with N-masked bases, and hence BWA alignments are inherently not compatible with allele-specific sorting in SNPsplit.

@FelixKrueger FelixKrueger changed the title BWA support BWA is not supported [as BWA does not allow Ns in the index] Apr 13, 2018
@RoseString
Copy link
Author

Thank you so much @FelixKrueger for looking into this in such details! It also sounds odd to me that BWA replaces Ns with random bases during indexing.

I really appreciate the time you put into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants