-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
markdup segmentation fault #393
Comments
probably similar issue here. I'm also on conda v0.6.8 For test purposes I have a very small bam and it seems releated to the number of threads used.
|
ah maybe this is due to a high number of duplicates. |
so actually the segfaults seem more random, I changed parameters and sometimes it crashes, sometimes it runs through...so far unpredictable what causes this.... bioconda version/build issue? |
here is a dump |
Thanks for the dump. I am happy to chase the problem. You can see it is happening in the garbage collector. Can you try the latest binary release and see if this does the same? https://github.com/biod/sambamba/releases Also, what hardware are you on. Some Xeons have threading problems. |
different cloud servers, but last one was a Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz dump from cloned master: I compiled with make debug, but looks less informative than with the conda version |
I think that is one of the unreliable Xeons for hyperthreading. Sambamba is one of the rare tools which brings it out by utilizing all cores. See also #335. Check the list by Intel. |
mhmm that's bad. so using less cores could be more stable? I also get a segfault with sambamba depth .... |
One option is to turn off hyperthreading. Sadly that is hard in most HPC environments. Alternatively don't use those machines or rerun until it gets through. You can blame Intel. |
yeah thanks! Btw, I downgraded to v0.6.6 via conda and this seems more stable, so far I don't get any segfaults with the same data. Does this version uses a different underlying library or like that? |
It is possible. What version of LLVM and ldc is it using? Underlying tools are evolving fast. Sambamba did not change markdup essentially between those versions. |
It looks like the 0.6.6 conda version uses your pre-build packages from github whereas the 0.6.8. build it from scratch on the conda server as given here: |
Can you try running the pre-built version for 0.6.9?
|
unfortunately this doesn't help, the pre-build 0.6.9 also segfaults, and also a locally build 0.7.0-pre1 but what I observed I rarely/(never?) got segfaults with all versions from 1-8 threads, strangely also not with the local build with 9 threads, higher thread numbers really often/(always?) segfaults |
yeah with your manually (not using conda) downloaded pre-build 0.6.6 it also (so far) never crashes with whatever number of threads |
Hi folks, I ran 40 different samples and only 10 got this error, but I tried this command on a few machines and they all seg faulted. Do you have any suggestions? thanks,
|
@RichardCorbett what is the CPU you are on? Does using an older binary of sambamba work? |
CPU tested: I just had some filesystem problems overnight so retesting is going to be a challenge. In case it may be relevant I noticed that I was testing on BWA aln 0.5.7 BAM file which includes non-zero mapping qualities for unaligned reads. |
Yeah, this may be another Xeon with problems. It will be interesting to see if older versions of sambamba+llvm+ldc show the same problem. @steffenheyne can you confirm you are still good with 0.6.6 binary? And both of you, is there any way I could access one of these machines so I can debug the problem? If 0.6.6 works I could use that same build chain to compile a new binary and see if that keeps going. |
sambamba 0.6.6 was built with
|
Another thing we could try is building sambamba with the GNU D compiler. That would fine tune the diagnostics - whether it is with LLVM or D-runtime. |
A second test run overnight duplicated the error on a machine with this cpu: |
Great. I am starting to suspect the D runtime. I should try hitting one of our machines badly. |
I'll try rebuilding 0.6.6 in bioconda with more recent pinnings to see if (A) it still works without segfaulting and (B) it can still be combined with other recent packages in an environment. That will allow us (@steffenheyne and me) to get around the current issue. I agree that this is likely a D runtime issue. |
Thanks. I am too busy now to sort it now but should have time by Easter. |
Just adding my observation of this issue as well. Using conda Sambamaba (0.6.8 and now 0.6.9) in a snakemake workflow. Getting segfaults at markdup stage. Set the markdup rule to call 0.6.6 from conda and things seem to be processing well. |
Thanks. This is very annoying, good thing we have 0.6.6. |
points at
which triggers a GC sweep and segfaults. The stack trace is informative and it looks like we have invalid pointers on the stack. |
The debug version of sambamba renders
where the last line refers to another allocation
|
This may be what I am looking for
|
Just now, for the first time, sambamba ran without crashing :). The problem is an out of order execution. |
I think the problem is with scopedtask where a task is added to the threadpool using the stack rather than the heap. In particular this section https://github.com/biod/BioD/blob/master/bio/core/bgzf/inputstream.d#L391 where a task gets created and pushed on a roundbuf. When the garbage collector kicks in after reading the BAM file it wants to destroy this object but it is in an inconsistent state (maybe the thread already got cleaned up or it tries to clean up twice). I managed to prevent segfaulting by disabling the garbage collector, but obviously that won't do. The roundbuf is probably used to keep a task connected with the bgzf unpacking buffer. I am not sure why this is necessary. Also I am not convinced a threadpool is that much of a benefit for bgzf unpacking as the single threaded routine I wrote last year is blazingly fast. Need to figure out what the best approach is... |
Adding this code to the destructor which segfaults struct DecompressedBgzfBlock {
~this() {
stderr.writeln("destroy DecompressedBgzfBlock ",start_offset,":",end_offset," ",decompressed_data.sizeof);
};
ulong start_offset;
ulong end_offset;
ubyte[] decompressed_data;
} Running a decompress block typically reads
but before a segfault we get
After making sure the start_offset is set to 0 it looks like these blocks have become invalid and the garbage collector still tries to clean them up. |
Artem wrote:
My conclusion is that that 'magic' no longer works. Creating a thread on the stack and moving it to the roundbuffer on the Heap confuses the garbage collector which is kinda unsurprising. With markdup I can't disable the GC cleanup so it needs some surgery. |
We are having the same problem. Is the only solution at present to downgrade to 0.6.6? |
It is one of these things that take a few days of work to fix. It is on my list :/ |
It might save some users a bit of time if you make a kind of warning binary release of 0.6.6-stable and drop it at the top of the release page. I know how much of a pain it can be to track down stuff like this. We just had a big problem due to the Spectre/Meltdown patches changing the way that multithreaded interleaved system calls work. I wonder if this problem is related. |
Actually the latest binary release 0.7.0 works. https://github.com/biod/sambamba/releases/tag/v0.7.0 The problem is with later versions of the D compiler. |
I got the same error "Segmentation fault (core dumped)" using sambamba 0.7.0 installed with conda and running in an LSF cluster.... :/ |
Conda builds with a recent ldc. That is the problem at this point. |
A heads up. At this point build sambamba with an older ldc - like the binary released on github. We decided to replace the original bam reader with a new one I wrote almost 2 years ago and we have been testing. Rather than fix the GC related issue we are going to use the new bamreader which is simpler and therefore (hopefully) easier to maintain. @NickRoz1 who worked as a Google Summer of Code student for me on column based bams is doing that work. One important difference is that the worker threads are no longer part of the reader itself. My theory is that performance should be similar ;) |
Same segmentation fault issue here - had installed on HPC system with bioconda. Is there a way to know when the Conda build has been updated? Alternatively, if I extract from the 7.0 source code, how do I actually execute a tool, like markdup? Thanks! |
I am working on fixing this bug. Track progress here https://thebird.nl/blog/work/rotate.html |
@sjackman we have a new release of sambamba that fixes a long standing bug with the D runtime and should now compile with all versions of ldc. See also https://travis-ci.org/biod/sambamba |
Thanks for the heads up, Pjotr. Once you've tagged a new release, would you like to open a PR to bump the version of Sambamba in Brewsci/bio? You need to change only lines, and you can do it from the GitHub web interface if you like. See https://github.com/brewsci/homebrew-bio/blob/c5b38cfea1eff4b18ae19d9dded00f990769de84/Formula/sambamba.rb#L6-L7 |
@sjackman feels dirty to edit through the web interface. But hopefully it works :) |
Hehe. Thanks! |
I hate to revive a closed issue but I'm running into the segmentation fault issue with markdup in 0.8.0 as installed from conda. My build:
System info:
|
Does this happen reproducibly? Can you share your BAM file in that case? Hard to fix stuff if we don't get reproducible errors. |
An interesting twist on this. The file linked below reproducibly generates segmentation faults on my system when running single threaded (or whatever the default threads is; https://drive.google.com/file/d/1owSZqrrWkrfbsEdf81iLrlByyOmfU4hs/view?usp=sharing |
@pmagwene, thanks for the test file. This is a new issue and not related to earlier segfaults. I can not reproduce it on my AMD Ryzen 7 3700X 8-Core Processor and on a Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz. You may want try the static binary I'll release in a bit at https://github.com/biod/sambamba/releases. If that fails I suggest opening a new issue for the i9-9900k. |
Hello, I solved my segfault with --hash-table-size=4194304!!! but I don't quite understand the meaning of this parameter, can you explain it to me? Thank you so much |
Hi. This parameter sets the size of a hashmap.
https://en.m.wikipedia.org/wiki/Hash_table
https://github.com/biod/sambamba/blob/8a4102f9b2f95799dd0f432e6db15c57df75f537/sambamba/markdup.d#L318
If you have 10 keys value pairs but the size of hashmap is only 9, then
some key will be mapped to the same cell as another key ( a duplicate ).
The same will happen if you have 9 non distinct keys.
In case of sambamba I think it moves duplicate values to a separate list,
and then uses temp files to store these values if list is filled up.
Maybe this functionality wasn't tested well or it didn't account for
dataset of your size or with your number of duplicates (
https://github.com/biod/sambamba/blob/8a4102f9b2f95799dd0f432e6db15c57df75f537/sambamba/markdup.d#L467
).
…On Wed, Aug 3, 2022, 11:35 Emma1997WTT ***@***.***> wrote:
ah maybe this is due to a high number of duplicates. I increased the
--hash-table-size=4194304 and then it runs through without segfault
Hello, I solved my segfault with --hash-table-size=4194304!!! but I don't
quite understand the meaning of this parameter, can you explain it to me?
Thank you so much
—
Reply to this email directly, view it on GitHub
<#393 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKJFOIXDZK3WCPX3KQAVZYTVXIVNVANCNFSM4HCFFO7Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks so much for your quick and professional reply,I get it now! Have a nice day! |
sambamba 0.6.8 from bioconda
I run markdup on a cluster and generated the segmentation fault error and left a core.* temporary file. This error is consistent among all the computer with different CPU (either Xeon or AMD Opteron). Any thoughts?
The text was updated successfully, but these errors were encountered: