Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix:typha:avoid concurrent map iteration and map write #7743

Closed
wants to merge 2 commits into from

Conversation

detailyang
Copy link
Contributor

@detailyang detailyang commented Jun 6, 2023

Description

We discovered a panic in typha (v3.21.4) due to concurrent iteration and writing on the map.

concurrent map iteration and map write

2023-06-06T05:27:41.055018511+08:00 stderr F
2023-06-06T05:27:41.055032404+08:00 stderr F goroutine 8664205 [running]:
2023-06-06T05:27:41.055036828+08:00 stderr F runtime.throw({0x1baac46, 0x110})
2023-06-06T05:27:41.055040534+08:00 stderr F    /usr/local/go-cgo/src/runtime/panic.go:1198 +0x71 fp=0xc01ba469d0 sp=0xc01ba469a0 pc=0x5a5dd1
2023-06-06T05:27:41.055043663+08:00 stderr F runtime.mapiternext(0x1960920)
2023-06-06T05:27:41.055046804+08:00 stderr F    /usr/local/go-cgo/src/runtime/map.go:858 +0x4eb fp=0xc01ba46a40 sp=0xc01ba469d0 pc=0x580b8b
2023-06-06T05:27:41.055058091+08:00 stderr F github.com/sirupsen/logrus.(*Entry).WithFields(0xc011125c20, 0xc01ba46b78)
2023-06-06T05:27:41.055061711+08:00 stderr F    /go/pkg/mod/github.com/projectcalico/logrus@v1.0.4-calico/entry.go:80 +0xb1 fp=0xc01ba46af8 sp=0xc01ba46a40 pc=0x68c3d1
2023-06-06T05:27:41.055064791+08:00 stderr F github.com/sirupsen/logrus.(*Entry).WithField(...)
2023-06-06T05:27:41.055067663+08:00 stderr F    /go/pkg/mod/github.com/projectcalico/logrus@v1.0.4-calico/entry.go:74
2023-06-06T05:27:41.055070845+08:00 stderr F github.com/sirupsen/logrus.(*Entry).WithError(...)
2023-06-06T05:27:41.055073651+08:00 stderr F    /go/pkg/mod/github.com/projectcalico/logrus@v1.0.4-calico/entry.go:69
2023-06-06T05:27:41.055076437+08:00 stderr F github.com/projectcalico/calico/typha/pkg/syncserver.(*connection).sendMsg(0xc0dcf4e300, {0x198d120, 0xc09d633c20})
2023-06-06T05:27:41.055079943+08:00 stderr F    /go/src/github.com/projectcalico/calico/typha/pkg/syncserver/sync_server.go:725 +0x1e9 fp=0xc01ba46cc8 sp=0xc01ba46af8 pc=0x1776dc9
2023-06-06T05:27:41.055083164+08:00 stderr F github.com/projectcalico/calico/typha/pkg/syncserver.(*connection).streamSnapshotToClient.func1()
2023-06-06T05:27:41.055086186+08:00 stderr F    /go/src/github.com/projectcalico/calico/typha/pkg/syncserver/sync_server.go:905 +0x1dd fp=0xc01ba46ff0 sp=0xc01ba46cc8 pc=0x177941d
2023-06-06T05:27:41.055098475+08:00 stderr F github.com/projectcalico/calico/typha/pkg/syncserver.(*connection).streamSnapshotToClient(0xc0dcf4e300, 0xc0d1274b40, 0xc14d882f60)
2023-06-06T05:27:41.055111535+08:00 stderr F    /go/src/github.com/projectcalico/calico/typha/pkg/syncserver/sync_server.go:928 +0x503 fp=0xc01ba47570 sp=0xc01ba46ff0 pc=0x1778e03
2023-06-06T05:27:41.055115864+08:00 stderr F github.com/projectcalico/calico/typha/pkg/syncserver.(*connection).sendSnapshotAndUpdatesToClient(0xc0dcf4e300, 0xc0d1274b40)
2023-06-06T05:27:41.055119025+08:00 stderr F    /go/src/github.com/projectcalico/calico/typha/pkg/syncserver/sync_server.go:744 +0xc7 fp=0xc01ba47fc0 sp=0xc01ba47570 pc=0x1776fc7
2023-06-06T05:27:41.055122217+08:00 stderr F github.com/projectcalico/calico/typha/pkg/syncserver.(*connection).handle·dwrap·6()
2023-06-06T05:27:41.055125205+08:00 stderr F    /go/src/github.com/projectcalico/calico/typha/pkg/syncserver/sync_server.go:572 +0x2a fp=0xc01ba47fe0 sp=0xc01ba47fc0 pc=0x177584a
2023-06-06T05:27:41.055128272+08:00 stderr F runtime.goexit()
2023-06-06T05:27:41.055131361+08:00 stderr F    /usr/local/go-cgo/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc01ba47fe8 sp=0xc01ba47fe0 pc=0x5d6581
2023-06-06T05:27:41.055134385+08:00 stderr F created by github.com/projectcalico/calico/typha/pkg/syncserver.(*connection).handle
2023-06-06T05:27:41.05513736+08:00 stderr F     /go/src/github.com/projectcalico/calico/typha/pkg/syncserver/sync_server.go:572 +0x3cf
2023-06-06T05:27:41.055140264+08:00 stderr F
2023-06-06T05:27:41.055149813+08:00 stderr F goroutine 1 [select, 30538 minutes]:

the method connection sendMsg as the following

// sendMsg sends a message to the client.  It may be called from multiple goroutines.
func (h *connection) sendMsg(msg interface{}) error {
	if h.cxt.Err() != nil {
		// Optimisation, don't bother to send if we're being torn down.
		return h.cxt.Err()
	}
	h.logCxt.WithField("msg", msg).Trace("Sending message to client")
	envelope := syncproto.Envelope{
		Message: msg,
	}
	startTime := time.Now()

	// The gob Encoder has its own mutex, but we need to synchronise around the flush operation as well.
	h.writeLock.Lock()
	defer h.writeLock.Unlock()

	// Make sure we have a timeout on the connection so that we can't block forever in the synchronous
	// part of the protocol.  After we send the snapshot we rely more on the layer 7 ping/pong.
	if err := h.maybeResetWriteTimeout(); err != nil {
		h.logCxt.WithError(err).Info("Failed to set write timeout when sending to client.")
		return err
	}

	if err := h.encoder.Encode(&envelope); err != nil {
		h.logCxt.WithError(err).Info("Failed to write to client")
                return err
	}
	if err := h.flushWriter(); err != nil {
		h.logCxt.WithError(err).Info("Failed to flush write to client")
		return err
	}
	h.summaryWriteLatency.Observe(time.Since(startTime).Seconds())
	return nil
}

The reason why typha panic is h.logCxt.Data is not thread safe see the issue sirupsen/logrus#1046 and sirupsen/logrus#1272. We cannot change the Data in different goroutine (log hooks).

Here is the reproduced code: try with go run -race *.go

package main

import (
	"errors"
	"os"
	"path"
	"runtime"
	"strings"
	"time"

	"github.com/sirupsen/logrus"
	log "github.com/sirupsen/logrus"
)

func main() {
	e := logrus.NewEntry(logrus.StandardLogger())
	ConfigureEarlyLogging()
	for i := 0; i < 8; i++ {
		go func() {
			e.Error("started processing A")
			e.WithError(errors.New("foo")).Error("started processing A")
		}()
	}
	time.Sleep(5 * time.Second)
}

func ConfigureEarlyLogging() {
	logrus.SetOutput(os.Stdout)
	logrus.AddHook(&ContextHook{})
}

const (
	fieldFileName   = "__file__"
	fieldLineNumber = "__line__"
)

type ContextHook struct {
}

func (hook ContextHook) Levels() []log.Level {
	return log.AllLevels
}

func (hook ContextHook) Fire(entry *log.Entry) error {
	pcs := make([]uintptr, 10)
	if numEntries := runtime.Callers(1, pcs); numEntries > 0 {
		pcs = pcs[:numEntries]
		frames := runtime.CallersFrames(pcs)
		for {
			frame, more := frames.Next()
			if !shouldSkipFrame(frame) {
				entry.Data[fieldFileName] = path.Base(frame.File)
				entry.Data[fieldLineNumber] = frame.Line
				break
			}
			if !more {
				break
			}
		}
	}
	return nil
}

func shouldSkipFrame(frame runtime.Frame) bool {
	if strings.Contains(frame.File, "runtime/extern.go") {
		return true
	}
	if strings.HasSuffix(frame.File, "/hooks.go") ||
		strings.HasSuffix(frame.File, "/entry.go") ||
		strings.HasSuffix(frame.File, "/logger.go") ||
		strings.HasSuffix(frame.File, "/exported.go") {
		if strings.Contains(frame.File, "/logrus") {
			return true
		}
	}
	if strings.HasSuffix(frame.File, "/lib/logutils/logutils.go") {
		if strings.Contains(frame.File, "/libcalico-go") {
			return true
		}
	}
	if strings.HasSuffix(frame.File, "/lib/logutils/ratelimitedlogger.go") {
		if strings.Contains(frame.File, "/libcalico-go") {
			return true
		}
	}
	return false
}

This PR is a quick fix that avoids typha panic by not using h.logCxt in sendMsg (sendMsg callers already print the logs when its' error).

The optimal approach would be to refactor the logutils in order to obtain the FILENAME and LINE NUMBER, as suggested in sirupsen/logrus#63.

Related issues/PRs

Todos

  • [ x ] Tests
  • [ x ] Documentation
  • [ x ] Release note

Release Note

TBD

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

Signed-off-by: detailyang <detailyang@gmail.com>
Signed-off-by: detailyang <detailyang@gmail.com>
@detailyang detailyang requested a review from a team as a code owner June 6, 2023 07:47
@marvin-tigera marvin-tigera added this to the Calico v3.27.0 milestone Jun 6, 2023
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jun 6, 2023
@mazdakn
Copy link
Member

mazdakn commented Mar 19, 2024

@detailyang thanks for reporting this, and sorry for late reply. Have you been able to re-produce the issue in the newer Calico versions, maybe v3.26 or v3.27?

@radTuti radTuti modified the milestones: Calico v3.28.0, Calico v3.28.1 May 3, 2024
@fasaxc
Copy link
Member

fasaxc commented Jul 22, 2024

Looks like the bug is still there but I'd prefer to fix this on the write side instead of removing logs. The bug crops up when we do

h.logCtx.Info(...)

in one goroutine (which writes to the shared Data map). And:

h.logCtx.WithFields()

in another goroutine (which reads the map in order to make a copy).

General fix is to make a copy of the context for each goroutine:

logCtx := h.logCtx.WithField("thread", "<name of thread>")

(We could really do with swapping out logrus entirely, or fixing this in the common code, though)

@fasaxc
Copy link
Member

fasaxc commented Jul 22, 2024

Took a swing at fixing the underlying problem here: #9044

@fasaxc
Copy link
Member

fasaxc commented Jul 22, 2024

Actually, when trying to repro the original problem to write a test for my "fix", I found that it had been fixed upstream in logrus by adding event.Dup() to the log() method. Closing this since the bug is fixed.

@fasaxc fasaxc closed this Jul 22, 2024
@marvin-tigera marvin-tigera removed release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants