Skip to content
This repository was archived by the owner on Oct 12, 2022. It is now read-only.
/ druntime Public archive

Conversation

@9il
Copy link
Member

@9il 9il commented May 7, 2016

No description provided.

@dlang-bot
Copy link
Contributor

Fix Bugzilla Description
15939 GC.collect causes deadlock in multi-threaded environment

@DmitryOlshansky
Copy link
Member

LGTM

@Geod24
Copy link
Member

Geod24 commented May 7, 2016

Release notes ?

@MartinNowak
Copy link
Member

MartinNowak commented May 8, 2016

Wait, we don't merge this until we understand the actual cause, and why real-time signals fix the issue.

@9il
Copy link
Member Author

9il commented May 8, 2016

Whats the difference between RT signals and standard signals?

  • More than one RT signal can be queued for the process if it has the signal blocked while someone sends it. In standard signals only one of a given type is queued, the rest is ignored.
  • Order of delivery of RT signal is guaranteed to be the same as the sending order.

Why this bug was not found before?

  • This issue occurs on a distributed server with 2 CPUs (8/16 core/threads each). Probably 1 CPUs server would work well because USR signal would have the same behavior like RT signals.
  • Multiple services with multiple threads heavy works on multiple machines during multiple days. The probability to find this bug is high on the platform where it was found.

Is it safe to change USR signals to RT signals?

Yes, glibc uses first 2-3 RT signals for thread management.

@WalterWaldron
Copy link
Contributor

WalterWaldron commented May 9, 2016

More than one RT signal can be queued for the process if it has the signal blocked while someone sends it. In standard signals only one of a given type is queued, the rest is ignored.

There should only be one signal pending at a time due to the signal masks & synchronization. If there are multiple then there's a bug in the code. Working around the issue might mean that instead of a deadlock the issue shows up as memory corruption and other difficult to track down symptoms.

Order of delivery of RT signal is guaranteed to be the same as the sending order.

"sending order guarantee" applies to multiple of the same signal (you can add data to RT signals, so the attached data is received in order.) Otherwise they have fixed priority (lowest numbered have highest priority,) and are often delivered at lower priority than standard signals (to help avoiding coalesced standard signals.)

As I've previously stated, signal masks & synchronization should make ordering completely irrelevant for this code.

Is it safe to change USR signals to RT signals?

It breaks all applications that expect having SIGRTMIN / SIGRTMIN+1 available for use.

I'm not against a change if it's necessary, I just want a solid justification instead of scant details and hand waving.

Why this bug was not found before?

  • This issue occurs on a distributed server with 2 CPUs (8/16 core/threads each).

So this is a NUMA machine? It would have been nice if this detail was included in the bug report.
I wonder if it is a synchronization bug. Have the mutex / spinlock / semaphore implementations been sufficiently tested on this machine to rule this out?

@9il
Copy link
Member Author

9il commented May 9, 2016

So this is a NUMA machine? It would have been nice if this detail was included in the bug report.
I wonder if it is a synchronization bug. Have the mutex / spinlock / semaphore implementations been sufficiently tested on this machine to rule this out?

Just multiple distributed servers with 2 CPUs each. Yes, mutex and semaphore works fine with RT signals (I don't know if signal type is relevant for them)

@leandro-lucarella-sociomantic
Copy link
Contributor

Whats the difference between RT signals and standard signals?

Could you add all that information in the commit message?

@joakim-noah
Copy link
Contributor

joakim-noah commented Jul 2, 2016

I just saw this PR, tried it on Android and it seems to work around a long-standing issue where SIGUSR1 caused Android apps to hang, likely because the Android runtime also uses it to force a GC collection.

Could you add the definitions of SIGRTMIN and SIGRTMAX for bionic as part of this pull? Here's what you'd have to add after the FreeBSD section in core.sys.posix.signal: Never mind, added it myself.

@joakim-noah
Copy link
Contributor

What's the holdup on this?

@9il
Copy link
Member Author

9il commented Aug 28, 2016

We need consensus about this issue first. At least from @MartinNowak

@dlang-bot dlang-bot added Needs Rebase needs a `git rebase` performed stalled and removed Needs Rebase needs a `git rebase` performed labels Jan 1, 2018
@dlang-bot dlang-bot added the Needs Rebase needs a `git rebase` performed label May 27, 2021
@rv32ima
Copy link
Contributor

rv32ima commented Oct 13, 2021

Ping @MartinNowak. Is there any reason why this hasn't been merged, other then a pending rebase? SIGUSR1/SIGUSR2 seems rather problematic to use for something as low-level as the runtime, and I could easily imagine how this could shoot a developer in the foot if they're not aware. In fact, I think I've actually been burnt by this exact problem, as a C library that I wrapped was using SIGUSR1/SIGUSR2, and it was causing a variety of seemingly random issues (and as a library developer, I don't believe that I should override global GC constants).

In place of SIGUSR1/SIGUSR2, it seems safer to use RT signals, as they're less frequently used (and typically, if you're using RT signals, you're probably aware of what the runtime uses) and won't lead to confusion among new developers.

As well, SIGRTMIN thru SIGRTMIN + 8 are guaranteed to exist from Linux kernel 2.2+ (see POSIX.1-2001, signal(7) man pages, etc), so this should not break backwards compatibility in a significant manner. LGTM.

@rv32ima
Copy link
Contributor

rv32ima commented Oct 13, 2021

It breaks all applications that expect having SIGRTMIN / SIGRTMIN+1 available for use.

This would be a silent breakage, indeed (and one that would go unnoticed if the GC is used sparingly enough to avoid collection -- not good!). I think a changelog entry would be in order, but I believe that this is a pretty big pain point that should be resolved.

@dlang-bot dlang-bot removed the stalled label Oct 13, 2021
@RazvanN7
Copy link
Contributor

RazvanN7 commented Nov 2, 2021

@9il Please rebase this. I cannot do it for you since you do not allow for external pushes to your fork.

@RazvanN7
Copy link
Contributor

RazvanN7 commented Nov 9, 2021

Closing due to inactivity. This issue is reachable through the link on bugzilla.

@RazvanN7 RazvanN7 closed this Nov 9, 2021
@dlang-bot dlang-bot added the Bug Fix Include reference to corresponding bugzilla issue label Nov 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Bug Fix Include reference to corresponding bugzilla issue Needs Rebase needs a `git rebase` performed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants