SSE implementation that sheds stuck clients #14413

kasey · 2024-09-03T13:39:35Z

What type of PR is this?

Bug fix

What does this PR do? Why is it needed?

This PR separates event stream api processing across 2 staged queues. First is a channel where event subscriptions are buffered, and second is an "outbox" with a capped size. The send to the outbox is protected by a select statement, so if the outbox can't be written, the event is dropped on the floor and a cleanup sequence is triggered, ending the event stream. Validation and filtering happens before the write to the outbox, but serialization is deferred via a closure until the event is ready to be written to the client.

A separate goroutine processes the outbox, draining the queue and calling flush on the response writer only once all events have been written to the client, removing the need for unnecessary flushing and regaining the benefits of connection buffering. If the keep-alive timer has fired, a keep-alive message is sent before flushing, but only if no events have been written. Rather than using a ticker, the keep-alive is tracked by a timer which is reset after the client flush completes.

Runtime errors like the wrong type being on a particular feed are logged as server-side errors and no longer pushed to the client as messages.

The unit test now uses what appears to be the popular golang the sse event client library to test that expected events are received.

This is still WIP; TODOs (before PR is merged):

fix other unit tests (only operations feed test fixed so far) and check if we want to add more coverage
rework the test response writer to internally use io.Pipe to more accurately mirror the way the sse library will scan the byte stream. Otherwise the test will be flaky due to the scanner hitting the end of an internal byte buffer and caching an io.EOF error.
self-review and cleanup
update CHANGELOG
do some testing by hand - maybe add a prysmctl command to stream remote events to stdout

Which issues(s) does this PR fix?

Slow readers of the event stream api can cause issues in the node when the queue backs up. The previous algorithm was for a select loop to read from event channels and write each event to the http ResponseWriter followed by a flush. A separate ticker would trigger http keep-alive without knowledge of what other writes or flushes were ongoing. This could result in unwanted backpressure on the event queue.

The stream would subscribe to both state feed and operation event channels, regardless of whether topics for those channels had been requested. It would perform a lot of duplicate error checking for conditions that would indicate bugs in the node code itself and treat them as runtime errors to be pushed to the client.

Fixes #

Other notes for review

Acknowledgements

I have read CONTRIBUTING.md.
I have made an appropriate entry to CHANGELOG.md.
I have added a description to this PR with sufficient context for reviewers to understand this PR.

james-prysm · 2024-09-03T15:16:25Z

beacon-chain/rpc/eth/events/events_test.go

@@ -26,15 +26,25 @@ import (
 	"github.com/prysmaticlabs/prysm/v5/testing/assert"
 	"github.com/prysmaticlabs/prysm/v5/testing/require"
 	"github.com/prysmaticlabs/prysm/v5/testing/util"
+	sse "github.com/r3labs/sse/v2"


we're just using this for tests, is it worth using for the event itself? i avoided adding this in the initial implementation to avoid more dependencies.

I think the benefit for testing is pretty clear as it is testing conformance with the way clients read values. LMK if there's a particular piece of the lib that you think would improve the server side.

I'd have to double check myself, i was talking to someone on this library and they mentioned that it did a few too many copies to be performant for his usecase, I don't know if that should apply to our usecase.

james-prysm · 2024-09-03T15:22:48Z

beacon-chain/rpc/eth/events/events.go

 				return
 			}
-		case <-ctx.Done():
-			return
+			if tp == PayloadAttributesTopic || (tp == HeadTopic && requestedTopics[PayloadAttributesTopic]) {


why isn't this part of the lazy reader? will we have to consider any other cases for this down the road?

I don't know if there will be other instances where we need to tack on payload attributes. It is an odd special case in the first place, in that it is handled in the stream event and not generalized into the state feed.

In most cases of the lazy reader we defer the serialization, but in this case there's a bunch of work we want to do up front - because we want to grab the head right when we get the event - and this has a bunch of error cases to handle. I shunted it here in a hurry because it was the odd case that broke the pattern of how the other lazy readers work as I refactored lazyReaderForEvent, but I do think it could make sense to move it to the cases for these topics under lazyReaderForEvent. I'll try that out and see how it looks.

james-prysm · 2024-09-03T22:35:33Z

beacon-chain/rpc/eth/events/events.go

+	go func() {
+		var err error
+		kaT := time.NewTimer(es.kaDur)
+		defer func() {


maybe we can merge defers here

Could you please add a reason for your suggestion?

The code would be functionally the same since defers execute sequentially (LIFO), the reason I wrote it this way is that I like to immediately follow the setup of a thing in need of cleanup with the defer to do so, ie: make a timer, set up the timer's cleanup.

james-prysm · 2024-09-03T22:41:27Z

beacon-chain/rpc/eth/events/events.go

+			case <-kaT.C:
+				err = es.writeOutbox(nil)
+				if err != nil {
+					return


if we aren't going to do anything with the errs here should we have a debug log at least behind a log level check? or is it not worth it here?

agreed we should log

oops I forgot that it is actually being logged in Cleanup - that's why var err error is predeclared above and passed to Cleanup. I'll add a comment to make that obvious. It's also currently logged as Error, but that should get bumped down to Debug.

updated comments

james-prysm · 2024-09-03T22:42:07Z

beacon-chain/rpc/eth/events/events.go

-				},
-				SignatureSlot: fmt.Sprintf("%d", updateData.Data.SignatureSlot),
-			},
+func (es *eventStreamer) writeOutbox(first lazyReader) error {


will need to reread, but can you explain why it's named first here? and how this differs from the rf on the outbox

first as in the first event to be written from the outbox to the client. The reason that we loop over the outbox within writeOutbox until it empties is that, in the time between writing the first event (or any subsequent event in the loop), additional events may have been enqueued while we are blocked waiting for the client to read. The goal here is to avoid unnecessary flushes or keep-alives, and the way we do that is process all the events as buffered writes, only issuing one flush at the end of the complete batch. Blocking in this method prevents the calling loop from running, preventing unnecessary keep-alive messages and flushes.

james-prysm · 2024-09-03T22:42:53Z

beacon-chain/rpc/eth/events/events.go

-		return send(w, flusher, LightClientFinalityUpdateTopic, update)
-	case statefeed.LightClientOptimisticUpdate:
-		if _, ok := requestedTopics[LightClientOptimisticUpdateTopic]; !ok {
+		written += 1


what do we do with this written counter? is it just for the written == check?

might be more clear to just keep a boolean

I had a thought to write a debug log in addition to treating it essentially as a boolean but yeah I'll just switch to boolean.

rkapka · 2024-09-06T18:42:41Z

api/server/structs/conversions.go

+				StateRoot:     hexutil.Encode(event.Data.AttestedHeader.StateRoot),
+				BodyRoot:      hexutil.Encode(event.Data.AttestedHeader.BodyRoot),
+			},
+			FinalizedHeader: &BeaconBlockHeader{


(you missed BodyRoot

It's actually missing in develop in the code I moved this construction from, but good catch!

rkapka · 2024-09-06T18:48:07Z

beacon-chain/rpc/eth/events/events.go

+	return topics
+}
+
+func validateTopics(topics []string) (bool, bool, map[string]bool, error) {


Can you use named return values to indicate what the two booleans represent?

I would rather not and rely on the calling code for context. Returning multiple booleans is a bit ugly though, let me see if there's a nicer way to do this.

rkapka · 2024-09-06T18:49:34Z

beacon-chain/rpc/eth/events/events.go

+		if topicsForStateFeed[topic] {
+			subState = true
+			requested[topic] = true
+			continue
+		}
+		if topicsForOpsFeed[topic] {
+			subOps = true
+			requested[topic] = true
+			continue
+		}


Suggested change

if topicsForStateFeed[topic] {

subState = true

requested[topic] = true

continue

}

if topicsForOpsFeed[topic] {

subOps = true

requested[topic] = true

continue

}

if topicsForStateFeed[topic] {

subState = true

}

if topicsForOpsFeed[topic] {

subOps = true

}

requested[topic] = true

continue

makes sense 👍

beacon-chain/rpc/eth/events/events.go

rkapka · 2024-09-06T18:59:31Z

beacon-chain/rpc/eth/events/events.go

 	}
-	return nil
+	f, ok := w.(StreamingResponseWriter)


why not just change the parameter type to StreamingResponseWriter?

That would be a runtime panic, whereas this way results in a runtime error that is safe to handle. This is sort of an odd case because we don't know what the http handler will give us until run time. Can you think of a way to force this to be a compile-time error?

rkapka · 2024-09-06T19:07:26Z

beacon-chain/rpc/eth/events/events.go

+				if err != nil {
+					return
+				}
+				// The timer has already fired here, so a call to Reset is safe.


This comment is in the wrong place, we call Reset in the next case

The reason for this comment is that < go 1.23 the Reset docs used to warn:
For a Timer created with NewTimer, Reset should be invoked only on stopped or expired timers with drained channels.

So I meant to convey that when Reset is called below, it's ok that the code in this case doesn't call Stop first, because we know the timer has fired. That's why in the other case (read from outbox) you see this comment:

We don't know if the timer fired concurrently to this case being ready, so we need to check the return of Stop and drain the timer channel if it fired.

I'm rewriting the comment to make it more clear.

Looks like in go 1.23 they cleaned all this up (and timers without any references can be garbage collected, even if they the code forgot to call Stop, yay!

rkapka · 2024-09-06T19:18:53Z

beacon-chain/rpc/eth/events/events.go

-		return send(w, flusher, LightClientFinalityUpdateTopic, update)
-	case statefeed.LightClientOptimisticUpdate:
-		if _, ok := requestedTopics[LightClientOptimisticUpdateTopic]; !ok {
+		written += 1


might be more clear to just keep a boolean

rkapka · 2024-09-06T20:12:05Z

beacon-chain/rpc/eth/events/events.go

+	}
+	for {
+		select {
+		case rf := <-es.outbox:


I am confused at how things are read from the outbox. You read from it in spawnWriteLoop in a for+switch (case lr := <-es.outbox) and then again here in another for+switch. Since this function is called from spawnWriteLoop, we are in a nested for loop where both the inner and outer loops read from the outbox (not concurrently, but still it's hard for me to wrap my head around what's going on).

Copying from my other comment in response to James asking a related question:

The reason that we loop over the outbox within writeOutbox until it empties is that, in the time between writing the first event (or any subsequent event in the loop), additional events may have been enqueued while we are blocked waiting for the client to read. The goal here is to avoid unnecessary flushes or keep-alives, and the way we do that is process all the events as buffered writes, only issuing one flush at the end of the complete batch. Blocking in this method prevents the calling loop from running, preventing unnecessary keep-alive messages and flushes.

it took a few times rereading to understand this

rkapka · 2024-09-11T19:12:11Z

beacon-chain/rpc/eth/events/events.go

+			}
+			// If the client can't keep up, the outbox will eventually completely fill, at which
+			// safeWrite will error, and we'll hit the below return statement, at which point the deferred
+			// Unsuscribe calls will be made and the event feed will stop writing to this channel.


Suggested change

// Unsuscribe calls will be made and the event feed will stop writing to this channel.

// unsuscribe calls will be made and the event feed will stop writing to this channel.

rkapka · 2024-09-11T19:12:20Z

beacon-chain/rpc/eth/events/events.go

+			// Unsuscribe calls will be made and the event feed will stop writing to this channel.
+			// Since the outbox and event stream channels are separately buffered, the event subscription
+			// channel should stay relatively empty, which gives this loop time to unsubscribe
+			// and cleanup before the event stream channel fills and disrupts other readers.


Suggested change

// and cleanup before the event stream channel fills and disrupts other readers.

// and clean up before the event stream channel fills and disrupts other readers.

james-prysm · 2024-09-17T22:27:42Z

beacon-chain/rpc/eth/events/events.go

+		httputil.HandleError(w, msg, http.StatusInternalServerError)
+		return
+	}
+	es, err := NewEventStreamer(eventFeedDepth, s.KeepAliveInterval)


nice looks pretty clean

james-prysm · 2024-09-17T22:30:08Z

beacon-chain/rpc/eth/events/events.go

+func (es *eventStreamer) StreamEvents(ctx context.Context, w StreamingResponseWriter, req *topicRequest, s *Server) error {
+	ctx, cancel := context.WithCancel(ctx)
+	defer cancel()
+	go es.recvEventLoop(ctx, cancel, req, s)


This was way easier to read 👍

james-prysm · 2024-09-19T16:21:03Z

beacon-chain/rpc/eth/events/events.go

@@ -189,15 +198,7 @@ type eventStreamer struct {
 	keepAlive time.Duration
 }

-func (es *eventStreamer) streamEvents(ctx context.Context, w StreamingResponseWriter, req *topicRequest, s *Server) error {


This was a little easier to read originally, I see that the recv Event loop is swapped with the outboxWriteLoop as well.

seems like this wrapper would be a nice to have IMO

james-prysm · 2024-09-19T16:22:31Z

beacon-chain/rpc/eth/events/events.go

+
+	ctx, cancel := context.WithCancel(ctx)
+	defer cancel()
+	api.SetSSEHeaders(w)


seems like it probably doesn't make a difference but maybe using sw will look more consistent.

prestonvanloon · 2024-09-26T15:15:17Z

WORKSPACE

+load("@bazel_gazelle//:deps.bzl", "gazelle_dependencies", "go_repository")
+
+go_repository(
+    name = "com_github_r3labs_sse_v2",
+    importpath = "github.com/r3labs/sse/v2",
+    sum = "h1:hFEkLLFY4LDifoHdiCN/LlGBAdVJYsANaLqNYa1l/v0=",
+    version = "v2.10.0",
+)
+
+go_repository(
+    name = "in_gopkg_cenkalti_backoff_v1",
+    importpath = "gopkg.in/cenkalti/backoff.v1",
+    sum = "h1:Arh75ttbsvlpVA7WtVpH4u9h6Zl46xuptxqLxPiSo4Y=",
+    version = "v1.1.0",
+)


Please revert this. Update your gazelle command to

bazel run //:gazelle -- update-repos -from_file=go.mod -to_macro=deps.bzl%prysm_deps -prune=true

and re-run gazelle please.

prestonvanloon · 2024-09-26T15:16:32Z

api/server/structs/conversions.go

+	}
+}
+
+func EventChainReorgFromV1(event *ethv1.EventChainReorg) *ChainReorgEvent {


Want to handle nil cases on these new functions? Any of them panic if the event is nil.

I don't think nil is possible ( based on where it's used)

It does seem to be the pattern overall for these functions to rely on their callers for nil checks. I added a paranoid check to the top of lazyReaderForEvent which will at least ensure this does not happen for the sse streamer code that calls it.

james-prysm

Great PR really improves our event stream design

kasey requested a review from a team as a code owner September 3, 2024 13:39

kasey requested review from potuz, terencechain, rkapka and james-prysm September 3, 2024 13:39

james-prysm reviewed Sep 3, 2024

View reviewed changes

prestonvanloon self-requested a review September 5, 2024 15:35

rkapka reviewed Sep 6, 2024

View reviewed changes

kasey force-pushed the async-event-streamer branch 2 times, most recently from 0e2f0b1 to c2ae175 Compare September 9, 2024 20:03

kasey changed the title ~~WIP: sse implementation that sheds stuck clients~~ SSE implementation that sheds stuck clients Sep 9, 2024

rkapka reviewed Sep 11, 2024

View reviewed changes

james-prysm reviewed Sep 17, 2024

View reviewed changes

james-prysm reviewed Sep 19, 2024

View reviewed changes

kasey force-pushed the async-event-streamer branch from ba5fc43 to eb57054 Compare September 19, 2024 20:58

prestonvanloon reviewed Sep 26, 2024

View reviewed changes

kasey force-pushed the async-event-streamer branch 3 times, most recently from 0c045fe to 8fa168e Compare October 3, 2024 20:53

kasey added 3 commits October 4, 2024 15:47

sse implementation that sheds stuck clients

e1ab5d3

Radek and James feedback

d6fc5bc

Refactor event streamer code for readability

bf65ddd

kasey added 7 commits October 4, 2024 15:47

less-flaky test signaling

93e474d

test case where queue fills; fixes

0cde09b

add changelog entry

f008b47

james and preston feedback

9440068

swap our Subscription interface with an alias

11a1fc7

event.Data can be nil for the payload attr event

6652af0

deepsource

5fbdc06

kasey force-pushed the async-event-streamer branch from 7dcdaca to 5fbdc06 Compare October 4, 2024 20:48

james-prysm approved these changes Oct 4, 2024

View reviewed changes

prestonvanloon approved these changes Oct 4, 2024

View reviewed changes

prestonvanloon added this pull request to the merge queue Oct 4, 2024

Merged via the queue into develop with commit c11e339 Oct 4, 2024
18 checks passed

prestonvanloon deleted the async-event-streamer branch October 4, 2024 21:25

	// Unsuscribe calls will be made and the event feed will stop writing to this channel.
	// unsuscribe calls will be made and the event feed will stop writing to this channel.

	// and cleanup before the event stream channel fills and disrupts other readers.
	// and clean up before the event stream channel fills and disrupts other readers.

SSE implementation that sheds stuck clients #14413

SSE implementation that sheds stuck clients #14413

Conversation

kasey commented Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

james-prysm left a comment

Choose a reason for hiding this comment

kasey commented Sep 3, 2024 •

edited

Loading