Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway stops indexing new blocks #590

Closed
Tracked by #6539
j1010001 opened this issue Sep 25, 2024 · 1 comment · Fixed by #618
Closed
Tracked by #6539

Gateway stops indexing new blocks #590

j1010001 opened this issue Sep 25, 2024 · 1 comment · Fixed by #618
Assignees
Labels
Bug Something isn't working

Comments

@j1010001
Copy link
Member

Problem

Over the last few days we had multiple occurences of EVM GW stopping block indexing.
examples:

  1. https://flow-foundation.slack.com/archives/C014WBGR1J9/p1727268188292399
  2. https://flow-foundation.slack.com/archives/C014WBGR1J9/p1727227388238869

The Flow process is still running, restarting the flow process on the Gateway temporarily resolves the problem.

@peterargue
Copy link
Contributor

We captured some data from the most recent hang. Besides the halt in indexing, the key indicator there is an issue is that goroutines start to climb:
Image

The goroutines profile shows they are waiting to get a lock creating new subscriptions on the blocks provider
Image

This lock is also held when notifying subscriptions about published messages:

func (p *Publisher[T]) Publish(data T) {
p.mux.RLock()
defer p.mux.RUnlock()
for s := range p.subscribers {
s.Notify(data)
}
}

The subscription Notify method previously had a blocking write of the error to the error channel

func (b *Subscription[T]) Notify(data T) {
err := b.callback(data)
if err != nil {
b.err <- err
}
}

That has since been fixed (PR)

func (b *Subscription[T]) Notify(data T) {
err := b.callback(data)
if err != nil {
select {
case b.err <- err:
default:
// TODO: handle this better!
panic(fmt.Sprintf("failed to send error to subscription %v", err))
}
}
}

Here is the code that deadlocks:

subs := models.NewSubscription(callback(notifier, rpcSub))
l := logger.With().
Str("gateway-subscription-id", fmt.Sprintf("%p", subs)).
Str("ethereum-subscription-id", string(rpcSub.ID)).
Logger()
publisher.Subscribe(subs)
go func() {
defer publisher.Unsubscribe(subs)
for {
select {
case err := <-subs.Error():
fmt.Println(err)
return
case err := <-rpcSub.Err():
l.Debug().Err(err).Msg("client unsubscribed")
return
}
}
}()

The deadlock can happen if the client disconnects, and before Unsubscribe() is called, the same subscription encounters an error. At that point, there is no listener so the open call to Notify() blocks, which also blocks Unsubscribe(), preventing either goroutine from exiting. New subscriptions then block calling Subscribe(), which is the source of the goroutine leak. Finally, the block indexer also uses this same publisher, so calls to Publish() within the block ingestion logic also blocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants