Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer panic from fabric/gossip/util #5119

Open
TaoGunner opened this issue Jan 24, 2025 · 3 comments
Open

Peer panic from fabric/gossip/util #5119

TaoGunner opened this issue Jan 24, 2025 · 3 comments
Labels

Comments

@TaoGunner
Copy link

TaoGunner commented Jan 24, 2025

Description

1-2 times a month my peers randomly crashed with the message invalid memory address or nil pointer dereference.
The system uses more than 50 peers and 50 channels, so exchanges through gossip are very active.

Peers, Orderers v2.5.5

The section of code where the bug appears clearly requires adding a check for the existence of a map element.

Panic example 1

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x151112d]

goroutine 32595521 [running]:
github.com/hyperledger/fabric/gossip/util.(*Set).Remove(0x0, {0x191b6e0, 0xc0051e4020})
        /gossip/util/misc.go:118 +0x2d
github.com/hyperledger/fabric/gossip/util.(*PubSub).unSubscribe(0xc002b89840, 0xc0051e4020)
        /gossip/util/pubsub.go:115 +0x8d
github.com/hyperledger/fabric/gossip/util.(*PubSub).Subscribe.func1()
        /gossip/util/pubsub.go:107 +0x1b
created by time.goFunc
        /usr/local/go/src/time/sleep.go:176 +0x2d

Panic example 2

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x151112d]

goroutine 1909191 [running]:
github.com/hyperledger/fabric/gossip/util.(*Set).Remove(0x0, {0x191b6e0, 0xc0064e7fa0})
        /gossip/util/misc.go:118 +0x2d
github.com/hyperledger/fabric/gossip/util.(*PubSub).unSubscribe(0xc005aaf980, 0xc0064e7fa0)
        /gossip/util/pubsub.go:115 +0x8d
github.com/hyperledger/fabric/gossip/util.(*PubSub).Subscribe.func1()
        /gossip/util/pubsub.go:107 +0x1b
created by time.goFunc
        /usr/local/go/src/time/sleep.go:176 +0x2d

Steps to reproduce

  1. Make more then 50 peers.
  2. Make more then 50 channels.
  3. idk what else (It happens unpredictably).
@TaoGunner TaoGunner added the bug label Jan 24, 2025
@yacovm
Copy link
Contributor

yacovm commented Jan 26, 2025

This looks like a race that was caused by a topic collision.

The below code should fix it. Do you want to give it a try?

diff --git a/gossip/util/pubsub.go b/gossip/util/pubsub.go
index 34e679962..96082bb2f 100644
--- a/gossip/util/pubsub.go
+++ b/gossip/util/pubsub.go
@@ -112,8 +112,14 @@ func (ps *PubSub) Subscribe(topic string, ttl time.Duration) Subscription {
 func (ps *PubSub) unSubscribe(sub *subscription) {
        ps.Lock()
        defer ps.Unlock()
-       ps.subscriptions[sub.top].Remove(sub)
-       if ps.subscriptions[sub.top].Size() != 0 {
+       subscription, exists := ps.subscriptions[sub.top]
+       if !exists {
+               return
+       }
+
+       subscription.Remove(sub)
+
+       if subscription.Size() != 0 {
                return
        }
        // Else, this is the last subscription for the topic.

@TaoGunner
Copy link
Author

I'm sure this fix will correct the error.

There is no way to quickly check this, because there is no exact algorithm for reproducing it. But if the error happens again, I'll let you know.

@pfi79
Copy link
Contributor

pfi79 commented Jan 26, 2025

I've been looking at the code and trying to figure out how this happens.

  1. It is created once and runs a timeout to delete it. It is deleted in one place and only once.
  2. Crashes on a mutex call. It doesn't find it. That is, a method is called to an nil structure

As it seems to me, this can happen for the following reasons:

  1. The structure has not been created yet, and the deletion is already being called. (unlikely)
  2. The structure deleting goroutine has been moved to another kernel, but the structure has not been moved. (like an error in the go itself). @yacovm If that's the case, your solution - ‘don't delete the object’ that we didn't find may leave a lot of objects that we didn't delete.

It seems to me that the first thing to do is to try the same experiment on version 2.5.10 and/or 3.0.0. They have a different version of go. Maybe the problem has already been fixed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants