update_bam

Introduction

Convert all the coordinates, IDs and some attributes in a BAM file. Requires output from fasta_diff.

Information in the output from fasta_diff

old_id - the id of the sequence in old_fasta
old_start - the start position of the sequence in old_fasta
old_end - the end position of the sequence in old_fasta
new_id - the id of the sequence in new_fasta
new_start - the start position of the sequence in new_fasta
new_end - the end position of the sequence in new_fasta

Example

output from fasta_diff

Scaffold139     0       46566   NW_012161945.1  0       46566
Scaffold111     0       904887  NW_012161915.1  0       904887
Scaffold90      0       211530  NW_012162413.1  0       211530

The header section

Tag	Description	state
VN	Format version.	keep
SO	Sorting order of alignments.	keep
SN	Reference sequence name.	According to information in the output file from fasta_diff, converted the reference sequence name from old_id to new_id.
LN	Reference sequence length.	According to information in the output file from fasta_diff, converted the Reference sequence length to new one.
ID	Format version.	keep
PN	Program name.	keep

Example

output from fasta_diff.

Scaffold4140    0        3969    KK249218.1      0       3969
Scaffold4139    2368     8532    JHOM01041610.1  0       6164
Scaffold1688    0        390     KK246853.1      0       390
Scaffold1688    2775     4110    KK246853.1      2775    4110
Scaffold1688    4670     5814    KK246853.1      4670    5814
Scaffold1688    8337     8871    KK246853.1      8337    8871
Scaffold1688    10333    11477   KK246853.1      10333   11477

old header

@HD     VN:1.0  SO:coordinate
@SQ     SN:Scaffold4140    LN:3969
@SQ     SN:Scaffold4139    LN:8532
@SQ     SN:Scaffold1688    LN:11477
@PG     ID:TopHat       VN:2.0.9        CL:/usr/local2/tophat-2.0.9.Linux_x86_64/tophat --mate-inner-dist 50 --library-type fr-unstranded --min-anchor-length 8 --splice-mismatches 0 --min-intron-length 70 --max-intron-length 50000 --min-isoform-fraction 0.15 --max-multihits 20 --min-segment-intron 50 --max-segment-intron 500000 --segment-mismatches 2 --segment-length 20 --num-threads 6 --b2-sensitive Aros01112013-genome Aros_ICD_SFI5MH.WT-66-1_2pA_D191TACXX-6-ID06_D191TACXX-6-ID06_1_sequence.txt Aros_ICD_SFI5MH.WT-66-1_2pA_D191TACXX-6-ID06_D191TACXX-6-ID06_2_sequence.txt

new header

@HD     VN:1.0  SO:coordinate
@SQ     SN:KK249218.1    LN:3969
@SQ     SN:JHOM01041610.1    LN:6164
@SQ     SN:KK246853.1    LN:11477
@PG     ID:TopHat       VN:2.0.9        CL:/usr/local2/tophat-2.0.9.Linux_x86_64/tophat --mate-inner-dist 50 --library-type fr-unstranded --min-anchor-length 8 --splice-mismatches 0 --min-intron-length 70 --max-intron-length 50000 --min-isoform-fraction 0.15 --max-multihits 20 --min-segment-intron 50 --max-segment-intron 500000 --segment-mismatches 2 --segment-length 20 --num-threads 6 --b2-sensitive Aros01112013-genome Aros_ICD_SFI5MH.WT-66-1_2pA_D191TACXX-6-ID06_D191TACXX-6-ID06_1_sequence.txt Aros_ICD_SFI5MH.WT-66-1_2pA_D191TACXX-6-ID06_D191TACXX-6-ID06_2_sequence.txt

The new length of the sequence can be calculated by the following formula.

new length = new_end - new_start (The sequence only has one data in the output file from fasta_diff. Ex. Scaffold4140 and Scaffold4139)
new length = maximum of new_end - minimum of new_start (The sequence has multiple data in the output file from fasta_diff. Ex. Scaffold1688)

The alignment section

Col	Field	Type	Brief description	state
1	QNAME	String	Query template NAME	keep
2	FLAG	Int	bitwise FLAG	keep
3	RNAME	String	Reference sequence NAME	According to information in the output file from fasta_diff, converted the RNAME from old_id to new_id
4	POS	Int	1-based leftmost mapping POSition	According to information in the output file from fasta_diff, converted the POS to the new one.
5	MAPQ	Int	MAPping Quality	keep
6	CIGAR	String	CIGAR string	keep
7	RNEXT	String	Ref. name of the mate/next read	According to information in the output file from fasta_diff, converted the RNEXT from old_id to new_id
8	PNEXT	Int	Position of the mate/next read	According to information in the output file from fasta_diff, converted the PNEXT to the new one.
9	TLEN	Int	observed Template LENgth	keep
10	SEQ	String	segment SEQuence	keep
11	QUAL	String	ASCII of Phred-scaled base QUALity+33	keep
12	TAGs	String	optional fields on a SAM/BAM Alignment	keep

Tags(col12) won't be updated in version 2.0!

The coordinates(POS and PNEXT) are converted by the following algorithm

BAM_new_start = BAM_old_start - old_start + new_start
BAM_old_end = BAM_old_end - old_start + new_start

In the following situation, the alignment will be removed.

mate/next read name(col7) or reference sequence name(col3) not not found in the output file from fasta_diff
reference start(col4) and reference end coordinates not contained within old_start to old_end
Position of the mate/next read(col8) outside the range of match_old_start to match_old_end

If a read not in the situation mentioned above, this read will be written into the bam updated file.(No matter its mate read will be removed or not.)

Example

CASE1: New sequence is 100% match with old sequence Information in the output file from fasta_diff

old_id	old_start	old_end	new_id	new_start	new_end
Scaffold1	0	2968087	NW_012161901.1	0	2968087

original BAM file

HWI-ST580:259:D191TACXX:6:2212:12148:97597      147     Scaffold1       47512   3       101M    =       47314   -299    TTTCTCCCAATGTATCCGGTTATTTGACCTCCGAGGGAGTGACCTATGACGTGGAGGGTCGTTGAATCCAGGGAAATTTCTACCATTCGGTTTAAAAGACC   #####A>33DCA>999DDCC@;55A95?B=ADFFFHHHGHHDHIIHGF@GFGGGCIGHGFJJIJFIIIHHIJJJJIHIHEFAJJJJJIHHHHHFFFFFBCB   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YT:Z:UU NH:i:2  CC:Z:=  CP:i:47512      HI:i:0

updated bam file

HWI-ST580:259:D191TACXX:6:2212:12148:97597      147     NW_012161901.1  47512   3       101M    =       47314   -299    TTTCTCCCAATGTATCCGGTTATTTGACCTCCGAGGGAGTGACCTATGACGTGGAGGGTCGTTGAATCCAGGGAAATTTCTACCATTCGGTTTAAAAGACC   #####A>33DCA>999DDCC@;55A95?B=ADFFFHHHGHHDHIIHGF@GFGGGCIGHGFJJIJFIIIHHIJJJJIHIHEFAJJJJJIHHHHHFFFFFBCB   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YT:Z:UU NH:i:2  CC:A:=  CP:i:47512      HI:i:0

CASE2:

new sequence is a substring of the old sequence with 100% match Information in the output file from fasta_diff

old_id	old_start	old_end	new_id	new_start	new_end
Scaffold4139	2368	8532	JHOM01041610.1	0	6164

original bam file

CCRI0219:155:C2LNBACXX:1:2210:11347:56234       137     Scaffold4139    3375    50      49M     *       0       0       GGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAG       IJ2CFHGIIIJJDIIIIIIDHIIEIJIGIIJGGIHHE?;BEECCBCD;@       MD:Z:1A5A41     RG:Z:RG1        XG:i:0  NH:i:1  NM:i:2  XM:i:2  XN:i:0  XO:i:0  AS:i:-12        YT:Z:UU

updated bam file

CCRI0219:155:C2LNBACXX:1:2210:11347:56234       137     JHOM01041610.1  1007    50      49M     *       0       0       GGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAG       IJ2CFHGIIIJJDIIIIIIDHIIEIJIGIIJGGIHHE?;BEECCBCD;@       MD:Z:1A5A41     RG:Z:RG1        XG:i:0  NH:i:1  NM:i:2  XM:i:2  XN:i:0  XO:i:0  AS:i:-12        YT:Z:UU

CASE3:

part of the original sequence was converted into Ns Information in the output file from fasta_diff

old_id	old_start	old_end	new_id	new_start	new_end
Scaffold1688	0	390	KK246853.1	0	390
Scaffold1688	2775	4110	KK246853.1	2775	4110
Scaffold1688	4670	5814	KK246853.1	4670	5814
Scaffold1688	8337	8871	KK246853.1	8337	8871
Scaffold1688	10333	11477	KK246853.1	10333	11477

original bam file

CCRI0219:155:C2LNBACXX:2:1302:21100:79098       163     Scaffold1688    4       50      45M     =       88      151     CTGCAGCAAATCGTACCGATATCCGCATCAGGTCTCCAAGGTGAA   EFGF+<CGIJJGE<ABFBEGEHFF==D<FCFFFCFFGCGI;.@;A   MD:Z:45 RG:Z:RG1        XG:i:0  NH:i:1  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YT:Z:UU

updated bam file

CCRI0219:155:C2LNBACXX:2:1302:21100:79098       163     KK246853.1      4       50      45M     =       88      151     CTGCAGCAAATCGTACCGATATCCGCATCAGGTCTCCAAGGTGAA   EFGF+<CGIJJGE<ABFBEGEHFF==D<FCFFFCFFGCGI;.@;A   MD:Z:45 RG:Z:RG1        XG:i:0  NH:i:1  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YT:Z:UU

Quick start

Requirement:

samtools
Install Python 2.7
Install Python pysam module (pip install pysam)

Testing enviroment:

Python 2.7
Python pysam module (0.9.1.4) installation
samtools (1.3.1)
htslib (1.3.1)

Input preparation:

match.tsv: the output from fasta_diff wiki
Bam file: the Bam file you want to update

Before processing the BAM file, you should build index first! build index

samtools index example.bam

Usage:

update_bam -a match.tsv example_file/example.bam

Example output INFO

INFO       Reading alignment data from: match.tsv...
INFO       Alignments: 522
INFO       Processing bam file: Aros_ID06.bam...
INFO       Update tag SN and LN....
INFO       Program record identifier TopHat
INFO       Standard meaning tag: CC:Z, CP:i will be updated...
INFO       Updated Alignments: 13483587
INFO       Removed Alignments: 0

Output files:

[user file name]_updated.bam
[user file name]_removed.bam

Running the program with –h prints the following help:

update_bam -h

usage: update_bam [-h] [-a ALIGNMENT_FILE] [-u UPDATED_POSTFIX]
                  [-r REMOVED_POSTFIX] [-v]
                  bam_FILE [bam_FILE ...]

Update the reference sequence name of the alignment and coordinates of a bam file using an alignment file generated by the fasta_diff program.
Updated features are written to a new file with '_updated'(default) appended to the original bam file name.
Feature that can not be updated, due to the id being removed completely or the feature contains regions that
are removed or replaced with Ns, are written to a new file with '_removed'(default) appended to the original bam file name.

Example:
    fasta_diff example_file/old.fa example_file/new.fa | update_bam example_file/example.bam

positional arguments:
  bam_FILE              List one or more bam files to be updated

optional arguments:
  -h, --help            show this help message and exit
  -a ALIGNMENT_FILE, --alignment_file ALIGNMENT_FILE
                        The alignment file generated by fasta_diff, a TSV file
                        with 6 columns: old_id, old_start, old_end, new_id,
                        new_start, new_end (default: STDIN)
  -u UPDATED_POSTFIX, --updated_postfix UPDATED_POSTFIX
                        The filename postfix for updated features (default:
                        "_updated")
  -r REMOVED_POSTFIX, --removed_postfix REMOVED_POSTFIX
                        The filename postfix for removed features (default:
                        "_removed")
  -v, --version         show program's version number and exit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update_bam

Introduction

Information in the output from fasta_diff

Example

The header section

Example

The alignment section

Example

Quick start

Requirement:

Testing enviroment:

Input preparation:

Usage:

Example output INFO

Output files:

Running the program with –h prints the following help:

Clone this wiki locally