Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MarkDup on Merge error #169

Closed
ACEnglish opened this issue Sep 28, 2015 · 10 comments
Closed

MarkDup on Merge error #169

ACEnglish opened this issue Sep 28, 2015 · 10 comments

Comments

@ACEnglish
Copy link

I have a pipeline that will use your merge/markdups at the same time (awesome feature, btw) from v0.5.8.
However, I pipe that output into calmd to repopulate the md/nm tags.

The command I use is:

 sambamba markdup -t 4 -p --hash-table-size=65536 --tmpdir=/space1/tmp/$PBS_JOBID `cat MergeBams.txt` /dev/stdout | samtools calmd -b - hg19.fa > mergeDup.bam

However, I'm getting an error

 sambamba-markdup: Unable to write to stream

Is this a problem with piping in a way that sambamba doesn't agree with? I was able to validate the mergeDup.bam, and it appears to have all the reads in it, so I think everything executed fine, it seems it just couldn't clean up in the end.

@lomereiter
Copy link
Contributor

  1. Please use -l 0 when it's in a pipe to avoid unnecessary compression (I'm contemplating if there's a simple way to do that automagically b/c users forget this all the time.)
  2. Since you are running it with -p flag, what does it output to stderr prior to exit?
  3. Does it work without the piping?
  4. What was your reasoning behind making --hash-table-size less than default? On a large dataset you would want to do the opposite and increase --overflow-list-size as well.

@ACEnglish
Copy link
Author

  1. Thanks for the tip!
  2. The stderr is muddled by samtools calmd output, but grepping that out gives us:
   finding positions of the duplicate reads in the file
   [============================================================]
   sorting 421221672 end pairs...   done in 61004 ms
   sorting 6699488 single ends (among them 0 unmatched pairs)... done in 380 ms
   collecting indices of duplicate reads...   done in 13898 ms
   found 90859212 duplicates, sorting the list...   done in 2382 ms
   collected list of positions in 139 min 28 sec
   marking duplicates...
   sambamba-markdup: Unable to write to stream
  1. I'm re-running now, but I do believe it works fine since the bam that's output from even the failed job validates via picard. I'll let you know when I get those results back.
  2. I'm not sure.. I'll have to ask around to see who made that decision and why.

@cviner
Copy link

cviner commented Oct 10, 2015

I also recently encountered this error. It occurred without my piping anything into nor out of markdup.

In my case, this appears to be caused by a large --io-buffer-size (but well below my available RAM). In particular, use of --io-buffer-size {4096, 2048} all resulted in this error. However, the error did not occur when this argument is not provided and also did not occur for explicitly-provided smaller values (all of --io-buffer-size {256, 512, 1024} completed without any reported errors).

@ACEnglish
Copy link
Author

I've tried running the commands separately and changed the parameters as suggested, but am still getting an error.

 finding positions of the duplicate reads in the file
 [===================================================================]
 sorting 530312331 end pairs...   done in 60231 ms
 sorting 18641379 single ends (among them 0 unmatched pairs)... done in 723 ms
 collecting indices of duplicate reads...   done in 15464 ms
 found 189989161 duplicates, sorting the list...   done in 4116 ms 
 collected list of positions in 642 min 3 sec
 marking duplicates...
 [                                                                              ]
 sambamba-markdup: Unable to write to stream

Here's the command

 sambamba markdup -l 0 -t 4 -p --tmpdir=/space1/tmp/ `cat MergeBams.txt` /space1/tmp//merge.bam

All of the inputBams passed bamUtil validation. So I'm not sure what else to try. Is there any possibility that there is a problem with the merge/mark command?

@lomereiter
Copy link
Contributor

Hi, could you please tell how many temporary files were created in /space1/tmp?

@ACEnglish
Copy link
Author

I've only been able to run a single test. I saw a peak of ~600 temporary files being created by sambamba.

To watch the files, I reassigned the tmpdir to somewhere on cluster storage (TMPDIR is local node storage) - What's weird is that this time it ran just fine. It seems that sambamba isn't consistently failing, so the problem seems more likely to be on my end.

My next question is since you were asking about temporary files, is there some sort of limit I should be looking for in my TMPDIR of temporary files?

@lomereiter
Copy link
Contributor

Could it be that your local node storage simply runs out of space? With such amount of duplicates it's quite possible.

I was asking because number of simultaneously open file descriptors is usually limited, see e.g. #118 - but that would lead to another error message, so it doesn't seem to be the case here.

@ACEnglish
Copy link
Author

It looks like temp space may be what's holding it up. Thank you for your feedback!

@lomereiter
Copy link
Contributor

@FrankFeng thanks, it turned out that the number was multiplied by the number of input files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@lomereiter @ACEnglish @cviner and others