Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues/raft thread saturation on mesh updates #16551

Closed
lstoll opened this issue Mar 7, 2023 · 1 comment · Fixed by #16552
Closed

Performance issues/raft thread saturation on mesh updates #16551

lstoll opened this issue Mar 7, 2023 · 1 comment · Fixed by #16552

Comments

@lstoll
Copy link
Contributor

lstoll commented Mar 7, 2023

Overview of the Issue

We were seeing performance issues in one of our mesh clusters that has about 5k participants, when doing deployments. The registration time would spike to 10+ seconds, which would cause errors and delays downstream. This corresponded to the raft/FSM thread being saturated for potentially minutes at a time. We seemed to have pretty low update rates overall, so this was surprising.

Screenshot 2023-03-07 at 11 44 06

Screenshot 2023-03-07 at 11 27 44

Screenshot 2023-03-07 at 11 27 49

Screenshot 2023-03-07 at 11 27 10

I grabbed a pprof to look in to it a bit more, but unfortunately a single routine on a busy server didn't give the greatest information. Still, reflection stood out here and it gave us somewhere to look.

Screenshot 2023-03-07 at 11 34 10

Reproduction Steps

To confirm what we were seeing here, I wrote a quick test case to run a snapshot of our data through this code path.

func BenchmarkRealData(b *testing.B) {
	sfpath := os.Getenv("STATE_BIN_PATH")
	if sfpath == "" {
		b.Skipf("neet to set STATE_BIN_PATH env var")
	}

	snapfile, err := os.Open(sfpath)
	if err != nil {
		b.Fatalf("failed opening snap file %s: %v", sfpath, err)
	}

	store := state.NewStateStore(nil)
	var idx uint64 = 1

	handler := func(header *fsm.SnapshotHeader, msg structs.MessageType, dec *codec.Decoder) error {
		switch {
		case msg == structs.RegisterRequestType:
			var req structs.RegisterRequest
			if err := dec.Decode(&req); err != nil {
				return err
			}
			if err := store.EnsureRegistration(idx, &req); err != nil {
				return fmt.Errorf("ensuring registration: %v", err)
			}
		default:
			// need to read the message, to advance the decoder past it
			var val interface{}
			if err := dec.Decode(&val); err != nil {
				return fmt.Errorf("decoding message: %w", err)
			}
		}
		return nil
	}
	if err := fsm.ReadSnapshot(snapfile, handler); err != nil {
		b.Fatalf("Error reading snapshot %s: %v", sfpath, err)
	}
}

Which confirmed that most of the registration time is spent in reflection.

image

Consul info for both Client and Server

Consul 1.14

Operating system and Environment details

Linux, amd64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant