Skip to content

update_bam

dytk2134 edited this page Sep 12, 2018 · 1 revision

Introduction

Convert all the coordinates, IDs and some attributes in a BAM file. Requires output from fasta_diff.

Information in the output from fasta_diff

  1. old_id - the id of the sequence in old_fasta
  2. old_start - the start position of the sequence in old_fasta
  3. old_end - the end position of the sequence in old_fasta
  4. new_id - the id of the sequence in new_fasta
  5. new_start - the start position of the sequence in new_fasta
  6. new_end - the end position of the sequence in new_fasta

Example

output from fasta_diff

Scaffold139     0       46566   NW_012161945.1  0       46566
Scaffold111     0       904887  NW_012161915.1  0       904887
Scaffold90      0       211530  NW_012162413.1  0       211530

The header section

Tag Description state
VN Format version. keep
SO Sorting order of alignments. keep
SN Reference sequence name. According to information in the output file from fasta_diff, converted the reference sequence name from old_id to new_id.
LN Reference sequence length. According to information in the output file from fasta_diff, converted the Reference sequence length to new one.
ID Format version. keep
PN Program name. keep

Example

output from fasta_diff.

Scaffold4140    0        3969    KK249218.1      0       3969
Scaffold4139    2368     8532    JHOM01041610.1  0       6164
Scaffold1688    0        390     KK246853.1      0       390
Scaffold1688    2775     4110    KK246853.1      2775    4110
Scaffold1688    4670     5814    KK246853.1      4670    5814
Scaffold1688    8337     8871    KK246853.1      8337    8871
Scaffold1688    10333    11477   KK246853.1      10333   11477

old header

@HD     VN:1.0  SO:coordinate
@SQ     SN:Scaffold4140    LN:3969
@SQ     SN:Scaffold4139    LN:8532
@SQ     SN:Scaffold1688    LN:11477
@PG     ID:TopHat       VN:2.0.9        CL:/usr/local2/tophat-2.0.9.Linux_x86_64/tophat --mate-inner-dist 50 --library-type fr-unstranded --min-anchor-length 8 --splice-mismatches 0 --min-intron-length 70 --max-intron-length 50000 --min-isoform-fraction 0.15 --max-multihits 20 --min-segment-intron 50 --max-segment-intron 500000 --segment-mismatches 2 --segment-length 20 --num-threads 6 --b2-sensitive Aros01112013-genome Aros_ICD_SFI5MH.WT-66-1_2pA_D191TACXX-6-ID06_D191TACXX-6-ID06_1_sequence.txt Aros_ICD_SFI5MH.WT-66-1_2pA_D191TACXX-6-ID06_D191TACXX-6-ID06_2_sequence.txt

new header

@HD     VN:1.0  SO:coordinate
@SQ     SN:KK249218.1    LN:3969
@SQ     SN:JHOM01041610.1    LN:6164
@SQ     SN:KK246853.1    LN:11477
@PG     ID:TopHat       VN:2.0.9        CL:/usr/local2/tophat-2.0.9.Linux_x86_64/tophat --mate-inner-dist 50 --library-type fr-unstranded --min-anchor-length 8 --splice-mismatches 0 --min-intron-length 70 --max-intron-length 50000 --min-isoform-fraction 0.15 --max-multihits 20 --min-segment-intron 50 --max-segment-intron 500000 --segment-mismatches 2 --segment-length 20 --num-threads 6 --b2-sensitive Aros01112013-genome Aros_ICD_SFI5MH.WT-66-1_2pA_D191TACXX-6-ID06_D191TACXX-6-ID06_1_sequence.txt Aros_ICD_SFI5MH.WT-66-1_2pA_D191TACXX-6-ID06_D191TACXX-6-ID06_2_sequence.txt

The new length of the sequence can be calculated by the following formula.

  • new length = new_end - new_start (The sequence only has one data in the output file from fasta_diff. Ex. Scaffold4140 and Scaffold4139)
  • new length = maximum of new_end - minimum of new_start (The sequence has multiple data in the output file from fasta_diff. Ex. Scaffold1688)

The alignment section

Col Field Type Brief description state
1 QNAME String Query template NAME keep
2 FLAG Int bitwise FLAG keep
3 RNAME String Reference sequence NAME According to information in the output file from fasta_diff, converted the RNAME from old_id to new_id
4 POS Int 1-based leftmost mapping POSition According to information in the output file from fasta_diff, converted the POS to the new one.
5 MAPQ Int MAPping Quality keep
6 CIGAR String CIGAR string keep
7 RNEXT String Ref. name of the mate/next read According to information in the output file from fasta_diff, converted the RNEXT from old_id to new_id
8 PNEXT Int Position of the mate/next read According to information in the output file from fasta_diff, converted the PNEXT to the new one.
9 TLEN Int observed Template LENgth keep
10 SEQ String segment SEQuence keep
11 QUAL String ASCII of Phred-scaled base QUALity+33 keep
12 TAGs String optional fields on a SAM/BAM Alignment keep

Tags(col12) won't be updated in version 2.0!

The coordinates(POS and PNEXT) are converted by the following algorithm

  • BAM_new_start = BAM_old_start - old_start + new_start
  • BAM_old_end = BAM_old_end - old_start + new_start

In the following situation, the alignment will be removed.

  • mate/next read name(col7) or reference sequence name(col3) not not found in the output file from fasta_diff
  • reference start(col4) and reference end coordinates not contained within old_start to old_end
  • Position of the mate/next read(col8) outside the range of match_old_start to match_old_end

If a read not in the situation mentioned above, this read will be written into the bam updated file.(No matter its mate read will be removed or not.)

Example

CASE1: New sequence is 100% match with old sequence Information in the output file from fasta_diff

old_id old_start old_end new_id new_start new_end
Scaffold1 0 2968087 NW_012161901.1 0 2968087

original BAM file

HWI-ST580:259:D191TACXX:6:2212:12148:97597      147     Scaffold1       47512   3       101M    =       47314   -299    TTTCTCCCAATGTATCCGGTTATTTGACCTCCGAGGGAGTGACCTATGACGTGGAGGGTCGTTGAATCCAGGGAAATTTCTACCATTCGGTTTAAAAGACC   #####A>33DCA>999DDCC@;55A95?B=ADFFFHHHGHHDHIIHGF@GFGGGCIGHGFJJIJFIIIHHIJJJJIHIHEFAJJJJJIHHHHHFFFFFBCB   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YT:Z:UU NH:i:2  CC:Z:=  CP:i:47512      HI:i:0

updated bam file

HWI-ST580:259:D191TACXX:6:2212:12148:97597      147     NW_012161901.1  47512   3       101M    =       47314   -299    TTTCTCCCAATGTATCCGGTTATTTGACCTCCGAGGGAGTGACCTATGACGTGGAGGGTCGTTGAATCCAGGGAAATTTCTACCATTCGGTTTAAAAGACC   #####A>33DCA>999DDCC@;55A95?B=ADFFFHHHGHHDHIIHGF@GFGGGCIGHGFJJIJFIIIHHIJJJJIHIHEFAJJJJJIHHHHHFFFFFBCB   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YT:Z:UU NH:i:2  CC:A:=  CP:i:47512      HI:i:0

CASE2:

new sequence is a substring of the old sequence with 100% match Information in the output file from fasta_diff

old_id old_start old_end new_id new_start new_end
Scaffold4139 2368 8532 JHOM01041610.1 0 6164

original bam file

CCRI0219:155:C2LNBACXX:1:2210:11347:56234       137     Scaffold4139    3375    50      49M     *       0       0       GGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAG       IJ2CFHGIIIJJDIIIIIIDHIIEIJIGIIJGGIHHE?;BEECCBCD;@       MD:Z:1A5A41     RG:Z:RG1        XG:i:0  NH:i:1  NM:i:2  XM:i:2  XN:i:0  XO:i:0  AS:i:-12        YT:Z:UU

updated bam file

CCRI0219:155:C2LNBACXX:1:2210:11347:56234       137     JHOM01041610.1  1007    50      49M     *       0       0       GGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAG       IJ2CFHGIIIJJDIIIIIIDHIIEIJIGIIJGGIHHE?;BEECCBCD;@       MD:Z:1A5A41     RG:Z:RG1        XG:i:0  NH:i:1  NM:i:2  XM:i:2  XN:i:0  XO:i:0  AS:i:-12        YT:Z:UU

CASE3:

part of the original sequence was converted into Ns Information in the output file from fasta_diff

old_id old_start old_end new_id new_start new_end
Scaffold1688 0 390 KK246853.1 0 390
Scaffold1688 2775 4110 KK246853.1 2775 4110
Scaffold1688 4670 5814 KK246853.1 4670 5814
Scaffold1688 8337 8871 KK246853.1 8337 8871
Scaffold1688 10333 11477 KK246853.1 10333 11477

original bam file

CCRI0219:155:C2LNBACXX:2:1302:21100:79098       163     Scaffold1688    4       50      45M     =       88      151     CTGCAGCAAATCGTACCGATATCCGCATCAGGTCTCCAAGGTGAA   EFGF+<CGIJJGE<ABFBEGEHFF==D<FCFFFCFFGCGI;.@;A   MD:Z:45 RG:Z:RG1        XG:i:0  NH:i:1  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YT:Z:UU

updated bam file

CCRI0219:155:C2LNBACXX:2:1302:21100:79098       163     KK246853.1      4       50      45M     =       88      151     CTGCAGCAAATCGTACCGATATCCGCATCAGGTCTCCAAGGTGAA   EFGF+<CGIJJGE<ABFBEGEHFF==D<FCFFFCFFGCGI;.@;A   MD:Z:45 RG:Z:RG1        XG:i:0  NH:i:1  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YT:Z:UU

Quick start

Requirement:

  • samtools
  • Install Python 2.7
  • Install Python pysam module (pip install pysam)

Testing enviroment:

  • Python 2.7
  • Python pysam module (0.9.1.4) installation
  • samtools (1.3.1)
  • htslib (1.3.1)

Input preparation:

  • match.tsv: the output from fasta_diff wiki
  • Bam file: the Bam file you want to update

Before processing the BAM file, you should build index first! build index

samtools index example.bam

Usage:

update_bam -a match.tsv example_file/example.bam

Example output INFO

INFO       Reading alignment data from: match.tsv...
INFO       Alignments: 522
INFO       Processing bam file: Aros_ID06.bam...
INFO       Update tag SN and LN....
INFO       Program record identifier TopHat
INFO       Standard meaning tag: CC:Z, CP:i will be updated...
INFO       Updated Alignments: 13483587
INFO       Removed Alignments: 0

Output files:

  1. [user file name]_updated.bam
  2. [user file name]_removed.bam

Running the program with –h prints the following help:

update_bam -h

usage: update_bam [-h] [-a ALIGNMENT_FILE] [-u UPDATED_POSTFIX]
                  [-r REMOVED_POSTFIX] [-v]
                  bam_FILE [bam_FILE ...]

Update the reference sequence name of the alignment and coordinates of a bam file using an alignment file generated by the fasta_diff program.
Updated features are written to a new file with '_updated'(default) appended to the original bam file name.
Feature that can not be updated, due to the id being removed completely or the feature contains regions that
are removed or replaced with Ns, are written to a new file with '_removed'(default) appended to the original bam file name.

Example:
    fasta_diff example_file/old.fa example_file/new.fa | update_bam example_file/example.bam

positional arguments:
  bam_FILE              List one or more bam files to be updated

optional arguments:
  -h, --help            show this help message and exit
  -a ALIGNMENT_FILE, --alignment_file ALIGNMENT_FILE
                        The alignment file generated by fasta_diff, a TSV file
                        with 6 columns: old_id, old_start, old_end, new_id,
                        new_start, new_end (default: STDIN)
  -u UPDATED_POSTFIX, --updated_postfix UPDATED_POSTFIX
                        The filename postfix for updated features (default:
                        "_updated")
  -r REMOVED_POSTFIX, --removed_postfix REMOVED_POSTFIX
                        The filename postfix for removed features (default:
                        "_removed")
  -v, --version         show program's version number and exit