Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many open files #118

Closed
RichardCorbett opened this issue Feb 6, 2015 · 8 comments
Closed

too many open files #118

RichardCorbett opened this issue Feb 6, 2015 · 8 comments

Comments

@RichardCorbett
Copy link

Hi.
I've been trying to merge and dup mark a large data set. Once merged, the coverage will be about 120 X.

Each time I try, I have been getting the following error (or something similar):
"sambamba-markdup: sambamba_testing/sambamba-pid23155-nwfz/sorted.509.bam.vo: Too many open files"

I noticed in the help that I could reduce the number of open files by specifying a larger value for "--overflow-list-size". However, I still get the same error.

Here is the command I've been using - can you point out anything I can change to get past the error?
"sambamba_v0.5.1 merge /dev/stdout P*bam | sambamba_v0.5.1 markdup --overflow-list-size 1000000 --tmpdir sambamba_testing /dev/stdin sambamba_marked.bam

finding positions of the duplicate reads in the file...
sambamba-markdup: sambamba_testing/sambamba-pid23155-nwfz/sorted.509.bam.vo: Too many open files
"

thanks,
Richard

@lomereiter
Copy link
Contributor

Hi Richard,

The large number (509) indicates that many paired reads are extremely far apart in the file (or there are no mates at all, e.g. all reads have the same direction or not properly named).
Please check that the paired reads don't have suffixes like '/1'.

@RichardCorbett
Copy link
Author

Hi,
Thanks for the quick reply. The reads don't have any weird suffixes (no /1, etc.) and we can merge/dupmark with Picard without any problems (just slowly).

Since this is a cancer sample, there are lots of regions with very high coverage. The average coverage is just over 100, but there are lots of regions higher than 1000X. In total there are over 3 billion reads. The alignment rate is above 95% and over 98% of the aligned reads are mapped in proper pairs with a mean insert size of 415bp. By our standards, these are pretty good stats for a human genome library.

Any ideas?

@lomereiter
Copy link
Contributor

Try to increase --hash-table-size as well, up to 1000000 or even 10 million if there's enough RAM (assuming a record occupies 1kb, it's 1 to 10 GB in total). That should help with the high-coverage regions.
(By the way, do you, by any chance, have a rough figure of Picard memory consumption on this dataset?)

@RichardCorbett
Copy link
Author

When we ran Picard we capped the RAM usage at 25Gigs. It used all of that and likely would have used a lot more had we allowed.

Using the sambamba pipe like I showed above I've seen each process use 50Gigs. If we move this to our production pipe, we'll need to make it so we can merge and mark dups 60Gigs total. Do you think that is possible?

@lomereiter
Copy link
Contributor

Even 30Gigs total should be possible. I've fixed a few leaks recently (#116), so top consumption of the latest binary build should be significantly lower.

@RichardCorbett
Copy link
Author

thanks, I'll give it a whirl and let you know what I find.

@RichardCorbett
Copy link
Author

Hi again,

Looks like I got farther this time, but ended up with a different error:

"time ./sambamba_02_02_2015 merge /dev/stdout P*bam | ./sambamba_02_02_2015 markdup --overflow-list-size 1000000 --hash-table-size 1000000 --tmpdir sambamba_testing /dev/stdin sambamba_marked.bam
finding positions of the duplicate reads in the file...
sorting 1377861776 end pairs... done in 301371 ms
sorting 23102470 single ends (among them 8026 unmatched pairs)... done in 2543 ms
collecting virtual offsets of duplicate reads... done in 141731 ms
found 82954291 duplicates, sorting the list... done in 4386 ms
collected list of positions in 863 min 26 sec
sambamba-markdup: Error reading BGZF block starting from offset 0: wrong BGZF magic
"

@lomereiter
Copy link
Contributor

Ouch. The streaming input is not supported by this tool, it makes a list of file offsets and then reads the file again. Sorry for wasting 15h of computational time. I'm closing this issue and opening another one, regarding the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants