Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed file from to-mr #50

Open
nickp60 opened this issue Nov 12, 2020 · 6 comments
Open

Malformed file from to-mr #50

nickp60 opened this issue Nov 12, 2020 · 6 comments

Comments

@nickp60
Copy link

nickp60 commented Nov 12, 2020

Hi, I am trying to troubleshoot an issue with using the to-mr executable to covert a bam file generated with bwa mem. I am using preseq in a docker container build from the following dockerfile:

FROM ubuntu:20.04
ARG VER="3.1.1"
RUN apt-get update && apt-get install wget autoconf automake make gcc perl zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-dev g++ -y
RUN wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar .bz2 && \
    tar -vxjf htslib-1.9.tar.bz2 && \  
    cd htslib-1.9 && \
    make && \
    make install
ENV LD_LIBRARY_PATH /htslib-1.9/
RUN wget https://github.com/smithlabcode/preseq/releases/download/v${VER}/pres\
eq-${VER}.tar.gz &&\
    tar xzf preseq-${VER}.tar.gz &&\
    cd preseq-${VER} &&\
    mkdir build && cd build &&\
    ../configure --enable-hts  &&\
    make &&\
    make install

I ran this on a file containing about 5M reads, and got the error below. for reproducing the error, I took a subset of the BAM, and it is attached.

I ran gc_extrap as follows and got the following error:

$  to-mr -o /data/sample.mr /data/sample.bam -L 10000 -v
[input file: /data/sample.bam]
[output file: /data/sample.mr]

$ preseq gc_extrap -o /data/sample.curve /data/sample.mr -v 
LOADING READS
MAPPED READ FORMAT
ERROR:	bad line in MappedRead file: (NULL)	0	0	X	0	+

Looking through the MR file, it seems that the following reads are causing the problem:

A00758:111:HL2FCDSXY:4:2369:32416:19554	131	chr1	16788	0	75M	=	17055	268	ACTAGGGAAGAAGGTGTG
TGACCAGGGAGGTCCCCGGCCCAGCTCCCATCCCAGAACCCAGCTCACCTACCTTGA	FFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF
FFFFFFFFFFFFFFFFF	NM:i:1	MD:Z:0T74	MC:Z:75M	AS:i:74	XS:i:74	XA:Z:chr1,+187310,75M,1;chr2,-113596574,74
M1S,1;chr16,+16472,1S74M,1;chr15,-101974098,75M,2;chr9,+16900,1S74M,1;
A00758:111:HL2FCDSXY:4:2369:32416:19554	67	chr1	17055	0	75M	=	16788	-268	CCTGCAAGATTAGGCAGG
GACATGTGAGAGGTGACAGGGACCTGCAGGGGCAGCCAACAAGACCTTGTGTGCACC	FF:FFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:75	MC:Z:75M	AS:i:75	XS:i:75	XA:Z:chrX,-156022938,75M,0;chr16,+16738,75M,0;chr2
,-113596307,75M,0;chr12,+17170,75M,0;chr9,+17166,75M,0;chr15,-101973831,75M,0;chr1,+187577,75M,0;chr12_GL877875v1_alt,+717
0,75M,0;

Running with bam2mr warns of several segment length issues, but results in an output file containing the passing reads:

~/bin/preseq_v2.0/bam2mr -o sample.old.mr sample.bam -v -seg_len 100000 2> old.log

Those files are in the attached zip as well.

Is there something wrong with my BAM file? I have been using preseq 2.0.3 for a while (and bam2mr), and am trying to update to 3.1.1. Converting the BAM to BED with bedtools's bedToBam and running preseq results in no errors.

Thanks in advance!

tmp.zip

@terencewtli
Copy link
Contributor

terencewtli commented Nov 17, 2020

Hi,

Thanks for posting this issue! I'm able to reproduce your results, and it looks like there's an issue with our to-mr program not properly accounting for paired-end mates that are too far apart. I think your file is fine - as you mentioned, bam2mr and bedToBam don't result in issue with the NULL line, and we've actually experienced this issue with to-mr in the past. We recently fixed this issue in another one of repositories using another utility - I will let you know when we update the program. Sorry in advance!

@nickp60
Copy link
Author

nickp60 commented Nov 19, 2020

Thanks for looking into it @terencewtli ! While we are at it, it would be great to get a rundown of what to-mr or bam2mr are actually doing. In our hands we have noticed that preseq's gc_extrap predictions are influenced by the number of input reads, but it is unclear how reads with different SAM flags are handled. One dataset for instance:

5805263 + 0 in total (QC-passed reads + QC-failed reads)
337398 + 0 secondary
11553 + 0 supplementary
0 + 0 duplicates
5792861 + 0 mapped (99.79% : N/A)
5456312 + 0 paired in sequencing
2728156 + 0 read1
2728156 + 0 read2
4863978 + 0 properly paired (89.14% : N/A)
5436892 + 0 with itself and mate mapped
7018 + 0 singletons (0.13% : N/A)
142394 + 0 with mate mapped to a different chr

# primary reads
$ samtools view -c  -F 256  5003-JW-3.md.n5m.000.bam 
5467865

Looking at the line count, I see around 3M, which makes ballpark sense as its paired end sequencing and the MR file should be a mix of mostly pairs and some singletons:

 $bam2mr -o /data/5003-JW-3.md.n5m.000.bam.mr /data/5003-JW-3.md.n5m.000.bam
 $ wc -l ./5003-JW-3.md.n5m.000.bam.mr
3023474 ./5003-JW-3.md.n5m.000.bam.mr

However, its unclear which reads are included in the MR file. This sequencing run has 151 bp reads, so I looked at the MR fragment length in relation to that, to approximate the number of singletons and pairs:

cat 5003-JW-3.md.n5m.000.bam.mr | awk '{$4=($3-$2)>160} {print $4}' | grep 1 | wc -l 
2247473

$cat 5003-JW-3.md.n5m.000.bam.mr | awk '{$4=($3-$2)>160} {print $4}' | grep 0 | wc -l 
776001

So (2247473*2) + 776001 is 5270947, which differs from 5467865. I know there are a lot of complexities with how the reads are represented in both the BAM/SAM formats and the MR format, but any insight you could give here would be great.

@vmkalbskopf
Copy link

I'm getting a very similar error with to-mr generating invalid mr files.

ERROR: bad line in MappedRead file: (NULL) 0 0 M03562:43:000000000-BLRTF:1:2116:8312:7359 0 -

I deleted that line (the first line) and still got an error:

ERROR: bad line in MappedRead file: PRELSG_01_v1 113 259 M03562:43:000000000-BLRTF:1:2116:8312:7359 0 + CGACGTCGCTTTTTGATCCTTCGATGTCGGCTCTTTCTTTCATTTTGAAGCAGAATTCACCAAGCGTTGGTTTGTTCACCCACTAATAGGGAACGTGTGCTGGGTTTAGACCGTCGTGAGACAGGTTAGTTTTACCCTACTGATGA

The mr file has other issues too. Here is one of the following lines:

PRELSG_01_v1 146 259 FRAG:M03562:43:000000000-BLRTF:1:2106:6581:1365 0 - TCATCAGTAGGGTAAAACTAACCTGTCTCACGACGGTCTAAACCCAGCTCACGATCCCTATTAGTGGGTGAACAATCCAACGCTTGGTGAATTCTGCTTCACAATGATAGGAA ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^A^A^A^A<8F><FF><FF><FF>^@^@^@^@^@^@^@^@TCATCAGTAGGGTAAAACTAACCTGTCTCACGACGGTCTAAACCCAGCTCACGATCCCTATTAGTGGGTGAACA

@vmkalbskopf
Copy link

vmkalbskopf commented Feb 11, 2022

I'd like to try using bam2mr, but I can't build an old version of preseq. I get a make error:

`fatal error:
'string' file not found

#include
^~~~~~~~`

Is there a way I can download a binary for bam2mr?

@andrewdavidsmith
Copy link
Contributor

@vmkalbskopf I'll do my best to respond, but would it be possible to separate these issues? I would like to help you get things working asap, but I'd still like a record of the individual errors. Also, I can't promise a binary, but if you email me directly we might find a work-around.

@vmkalbskopf
Copy link

@andrewdavidsmith so kind of you to respond so quickly.

I will create a separate issue for the garbled mr file.
What's your email address?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants