Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MercureHub on Timeout #758

Closed
ihab95 opened this issue Apr 6, 2023 · 26 comments
Closed

MercureHub on Timeout #758

ihab95 opened this issue Apr 6, 2023 · 26 comments

Comments

@ihab95
Copy link

ihab95 commented Apr 6, 2023

Hello MercureTeam and the community,

We installed and integrated it on a Docker container, and now we are on the testing stage, near to go live.

All was OK until we've faced an incomprehensible bug and we need your help to understand :

Mercure completely stopped to treat our data, impossible to push any message on Mercure hub. Our publisher workers are meeting a timeout. Mercure container is still up, and Caddy healthz route is responding a 200.

The only log we got is :
{"level":"debug","ts":1680709312.2403877,"logger":"http.handlers.mercure","msg":"write timeout: close the connection","subscriber":{"id":"urn:uuid:5fd605c2-7762-4819-839d-a3eca847488a","last_event_id":""...

No way to know why our Mercure Hub is not callable.

@ihab95 ihab95 changed the title Timeout MercureHub on Timeout Apr 6, 2023
@dunglas
Copy link
Owner

dunglas commented Apr 6, 2023

That's weird. That may (?) be a deadlock? What happens if you restart the hub?

If you can reproduce, could you enable the debug mode and start a profile to see what's going on? https://mercure.rocks/docs/hub/debug

@ihab95
Copy link
Author

ihab95 commented Apr 7, 2023

We already restarted the hub, and it has solved the problem, but it will probably returns...
We're not able to reproduce actually, we are waiting for another occurence of the bug.

Thanks for the tip, we just enabled the debug mode, it could help us when it'll happen

@mxmp210
Copy link

mxmp210 commented Apr 9, 2023

I can confirm something similar as well, if hub is running for long time(>1 Week ), it slows down significantly and requests just take forever to process - restarting hub resets everything. This has been happening quite few times recently on chat app I'm working on. We have quite bit of updates that's happening every time.

The local transport is used so it is just a relay machine - no disk is involved in config (i.e. no bolt transport). In our case publishing and subscription happens but request / update times are > 6s to > 20s which suggests either caddy or mercure is having internal resource issues or we are missing something in config that's keeping hub busy ( which is I'm trying to investigate )

I was reluctant to report issue till it's figure out if it was config but it seems other users are having same outcome.
It seems there's internal memory leak that is it slowing down - rebooting mercure will reconnect all users(From client side re-connection methods) and you can see how less memory they use with same amount of load and number of connection connections. Also this pattern is unpredictable because it happens after either certain number of connections or number of updates as hub get used - not sure which one yet.

image

@dunglas
Copy link
Owner

dunglas commented Apr 9, 2023

@mxmp210 could you take a memory profile when it happens? This would help a lot: https://mercure.rocks/docs/hub/debug

@dunglas
Copy link
Owner

dunglas commented Apr 9, 2023

Could you also enable the metrics and share the data you get before and after the restart please?

@mxmp210
Copy link

mxmp210 commented Apr 10, 2023

@dunglas We are already collecting metrics - which tells correlation between connections, memory and subscriptions.

Enabling debug in production would be a bit hard - the original hub was moved yesterday without my knowledge so it is going to take some time before it starts to congest inside after several days of running.

Here are memory profile from metrics that were collected in past 15 days - with 3x hub restarts when it became too slow.
Ignore Last Day as we moved hub to new location to see if network is an issue.

Memory Profile :
image

Go Allocated Memory:
image

Go Ideal Bytes [This is interesting one as we see there's alot of memory sitting around while subs come and go away] :
image

Subscriber Counter:
image

Updates Counter:
image

Connections :
image

Go Routines [ For No of updates processed - peaks are because of system trying to push all pending updates]
image

Some other stats :
Next GC Bytes - This is also interesting as go might not be getting enough breathing time to invoke GC?
image

If you need more detailed data - let me know.

@mxmp210
Copy link

mxmp210 commented Apr 10, 2023

One Note : There are times when some connections do not timeout (usually 10~50 connections) - there's no proxy involved and caddy is directly communicating with outside world running natively in machine. Then there are times this number keeps increasing without any notice.

The server has same amount of connections since past few months - having peaks at several hours and settling down to average connections which constantly hits mercure with changing subscriptions. It started happening probably after internal caddy version was changed - though we compile from source so latest caddy is always used.

@dunglas
Copy link
Owner

dunglas commented Apr 10, 2023

Thanks for the details! Could you tell us the exact version of Caddy you use (mercure version)?

cc @mholt @francislavoie as it looks related to a recent change in Caddy.

@mholt
Copy link

mholt commented Apr 10, 2023

Thanks; could we have some more information:

What Caddy version? Did this start happening after an upgrade (from which version)? Are these websockets (I know Mercure uses SSE -- but I'm asking anyway)? What is the Caddy config? How can we minimally reproduce this?

@mxmp210
Copy link

mxmp210 commented Apr 10, 2023

@mholt Current Deployed Caddy Version : v2.6.4 h1:2hwYqiRwk1tf3VruhMpLcYTg+11fCdr8S3jhNAdnPy8=
Is it Websocket? : Yes - SSE over websocket from Mercure Hub
Did this start happening after upgrade? : This is recent behavior - we have had similar traffic earlier with same config no memory issues / slowing down detected - earlier we had legacy builds deployed on production followed by first caddy build [No memory issues] and then second released in February, 2023.

Deployment has all Standard Caddy Modules with Mercure as custom module

...
  Standard modules: 100
http.handlers.mercure
  Non-standard modules: 1
  Unknown modules: 0

Caddy Config loaded in JSON format to omit INFO messages:

{
    "logging": {
        "logs": {
            "default": {
                "exclude": [
                    "http.log.access.logFiltered",
                    "http.handlers.mercure"
                ]
            },
            "logFiltered": {
                "level": "ERROR",
                "include": [
                    "http.log.access.logFiltered"
                ]
            }
        }
    },
    "apps": {
        "http": {
            "servers": {
                "srv0": {
                    "listen": [
                        ":443"
                    ],
                    "routes": [
                        {
                            "match": [
                                {
                                    "host": [
                                        "<domain_name>"
                                    ]
                                }
                            ],
                            "handle": [
                                {
                                    "handler": "subroute",
                                    "routes": [
                                        {
                                            "handle": [
                                                {
                                                    "handler": "subroute",
                                                    "routes": [
                                                        {
                                                            "handle": [
                                                                {
                                                                    "cors_origins": [
                                                                        "<domain_name>"
                                                                    ],
                                                                    "handler": "mercure",
                                                                    "publish_origins": [
                                                                        "*"
                                                                    ],
                                                                    "publisher_jwt": {
                                                                        "key": "!ChangeMe!"
                                                                    },
                                                                    "subscriber_jwt": {
                                                                        "key": "!ChangeMe!"
                                                                    },
                                                                    "transport_url": "local://local",
                                                                    "subscriptions": true,
                                                                    "write_timeout": 0
                                                                }
                                                            ]
                                                        },
                                                        {
                                                            "handle": [
                                                                {
                                                                    "handler": "subroute",
                                                                    "routes": [
                                                                        {
                                                                            "handle": [
                                                                                {
                                                                                    "handler": "metrics"
                                                                                }
                                                                            ]
                                                                        }
                                                                    ]
                                                                }
                                                            ],
                                                            "match": [
                                                                {
                                                                    "path": [
                                                                        "/metrics"
                                                                    ]
                                                                }
                                                            ]
                                                        },
                                                        {
                                                            "handle": [
                                                                {
                                                                    "handler": "static_response",
                                                                    "status_code": 200
                                                                }
                                                            ],
                                                            "match": [
                                                                {
                                                                    "path": [
                                                                        "/healthz"
                                                                    ]
                                                                }
                                                            ]
                                                        },
                                                        {
                                                            "handle": [
                                                                {
                                                                    "body": "Not Found",
                                                                    "handler": "static_response",
                                                                    "status_code": 404
                                                                }
                                                            ]
                                                        }
                                                    ]
                                                },
                                                {
                                                    "handler": "file_server",
                                                    "hide": [
                                                        "/etc/caddy/CaddyFile.json"
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ],
                            "terminal": true
                        }
                    ],
                    "logs": {
                        "logger_names": {
                            "<domain_name>": "logFiltered"
                        }
                    }
                }
            }
        }
    }
}

Note : In earlier build it failed to start with greater write_timeout so it was set to zero. New one follows same suite as given in config - which keeps connection open unless interrupted. This could be possible issue with open & stalled connections but we always had difference < 100 connections as traffic scales.

How to minimally Reproduce it?

  • Deploy hub normally and keep constant flow of subscribing & disconnecting connections.
  • Keep the updates rate on higher pace
  • Let it run for long period of time
  • Over time hub publishing rate slows down significantly i.e. caddy cannot process POST requests within millisecond range.
  • Subscription rate usually remains same.

For gataling test it can translate to following config [Not tested for long time] :

      /** Number of concurrent subscribers initially connected */
  val InitialSubscribers =
    Properties.envOrElse("INITIAL_SUBSCRIBERS", "10").toInt

    /** Additional subscribers rate (per second) */
  val SubscribersRateFrom =
    Properties.envOrElse("SUBSCRIBERS_RATE_FROM", "2").toInt
  val SubscribersRateTo =
    Properties.envOrElse("SUBSCRIBERS_RATE_TO", "5").toInt

    /** Publishers rate (per second) */
  val PublishersRateFrom =
    Properties.envOrElse("PUBLISHERS_RATE_FROM", "8").toInt
  val PublishersRateTo = Properties.envOrElse("PUBLISHERS_RATE_TO", "10").toInt

    /** Duration of injection (in seconds) */
  val InjectionDuration =
    Properties.envOrElse("INJECTION_DURATION", "600").toInt

  /** How long a subscriber can stay connected at max (in seconds) */
  val ConnectionDuration =
    Properties.envOrElse("CONNECTION_DURATION", "7200").toInt

Though tests are not that smart creating real-world load situations as it only published to one topic and multiple subscribers are listening to same resource. As this happens over very long time period after several million messages and few thousand subscriptions - something similar is required to reproduce accurate results.

@mholt
Copy link

mholt commented Apr 10, 2023

@mxmp210 Thanks, that's helpful. That's a lot of time and work to repro. but I'll try to go with the information provided so far:

Did this start happening after upgrade? : This is recent behavior - we have had similar traffic earlier with same config no memory issues / slowing down detected - earlier we had legacy builds deployed on production followed by first caddy build [No memory issues] and then second released in February, 2023.

I'm not quite sure I understand; do you mean you were using 2.6.3 and it was working fine, but then in 2.6.4 the issue started happening?

Also, do any of these requests use a Chunked encoding? (Transfer-Encoding: chunked)

Or, are the backends that the events are coming from using php/fastcgi?

@dunglas
Copy link
Owner

dunglas commented Apr 10, 2023

@mxmp210 Are you sure that you're using WebSockets? It's very unlikely, Mercure used SSE which are an totally different tech than WebSockets.

@mxmp210
Copy link

mxmp210 commented Apr 11, 2023

@mholt It might not be related to latest updates but other parts of caddy or mercure which handles connections pool - we are not using any compression as you can see negotiation headers between client and server (i.e. no content-encoding header in screenshot). Mercure hub sits in between client and backend - Backend POSTs updates to mercure topics which than takes care of sending updates to clients(subscribers) by writing data in Event Stream format using MiddlewareHandler interface. I'll try to grab memory snapshot when this happens again, the leak is so slow that cannot be produced with few thousand requests.

Last crash had about 3M updates and more than 2.5M subscriptions handled before it started to act slow.

@dunglas Calling it WebSockets would be a bit confusing, it is EventStream sent via raw connection - it is indeed SSE and not to be confused with other socket based tech.

Note that this happens to POST requests only, they become too slow (2s to 6s per request), Once caddy is able to take request sending to the clients is done instantly (i.e. mercure part dispatches updates to topics instantly). We have grabbed metrics and other endpoints which are returned without delays.

Attached Headers :
Screenshot 2023-04-11 124059

@mxmp210
Copy link

mxmp210 commented Apr 16, 2023

@mholt @dunglas I have an update to share with debug profiles - As expected Mercure tripped again after several days of usage.
Things changed in config : "write_timeout": "600s" Which allowed us to sustain bit of more traffic before incident and we can confirm it closes zombie connections left by clients earlier which we hoped to be causing this issue. Server slowed down a bit when it started but it recovered once peak connections closed and number went down leaving server with increased memory footprint.

Once hub starts to leak memory it would create stalled connections and subscription API does not report them but metrics API will counts those hidden connections - so we know those are there in memory.

This morning we let it run to occupy more memory and connections were not slowed yet. It happens when we go near 1K connections and significant amount of updates.

Here's common pattern that was observed from metrics [ Posting here so others can comment on it] :

  1. It happens around every 8 days for us which reaches connection threshold
  2. About 3.1Mil Updates just as last 2 resets
  3. Going beyond 250K connections served counter [ 365K, 268K and last one around 457K count]
  4. There are no zombie "Active" connections after write timeout period when we see through subscription API - it displays correct number of active users / subscriptions
  5. There is difference between counter value and subscription API value in connections [ 270+ connections atleast as of writing which are kind of lost]
  6. When this happens we see spike in memory usage in increments of 20~35MB each sample in metrics till all users drop off connections and reconnect. This will close some connections but allocated memory stays same in incrementing order.
  7. This time server recovered after users reconnected and slow processing vanished as of writing

Note for consideration : We have identical traffic pattern everyday with similar amount of connections. [See metrics below ]

Debug Profile after 1 day of new deployment :

pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz

Debug Profile after few hours of memory leak :

pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.005.pb.gz

Debug Profile after 18 hours of memory leak :

pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.007.pb.gz

I'm not sure if this is correct but the following part keeps increasing after leak occurs :

This morning [About 500MB occupied after leak started] :
image

As of writing after about 18 hours [ 860MB+ Occupied and we are not yet at peak hours ]
image

========================================================

Metrics For comparison GREEN is OLD hub with no write_timeout, YELLOW is new HUB with write_timeout set to 5minutes as per @dunglas suggestion :

Memory :
image

Subscriptions Handled [ HTTP Requests in terms of SSE handled ]
image

Updates Done by mercure
image

Connections Pattern - i.e. Active HTTP connections on server
image

Full metrics chart for better understanding
image

@mholt
Copy link

mholt commented Apr 17, 2023

Excellent, thank you. I will give this a closer look when I'm in the office tomorrow 👍

@mxmp210
Copy link

mxmp210 commented Apr 18, 2023

Thanks I'll wait for the insights.
Meanwhile it seems there's a lock situation which might be causing those stalled connections perhaps.

Some topics are stuck [probably waiting for lock to be removed] i.e. not processing any writes to clients and subscription endpoint is returning empty results (i.e. mercure isn't appending connection to subscriber list of certain topics) even when clients are connected. I'm appending one more snapshot with latest values this time with goroutine profile as well. Hope this will tells us more about what's going on.

As of writing stats are :
Memory Occupied : 1.15GB
Connections : 700K
Updates : 5.03Mil
Average Connections : 700~800

Routines that are waiting [Seems unrealistic as there are not 2.7k active connections or even half of that] :
image

Go Routines :
pprof.caddy.goroutine.006.pb.gz

Memory Snapshot :
pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.024.pb.gz

Last Metrics :
image

@mholt
Copy link

mholt commented Apr 18, 2023

@mxmp210 Do you have a "full goroutine dump"? It'd be useful to get the stack traces for all goroutines during the leak. The binary profile you've provided is helpful but only generates the chart AFAIK (maybe it has a full stack dump but I'm not sure how to read it -- I would be surprised if it does though given the size of the file). I'd expect something in plain text in this case.

The profile is still helpful though, it points to lock contention in the mercure package's Dispatch() method, but which implementation of it, I'm not sure, as it's an interface method.

@mholt
Copy link

mholt commented Apr 18, 2023

If I were to guess, maybe the lock contention is in here:

mercure/subscriber.go

Lines 73 to 81 in a5d1ef8

s.outMutex.Lock()
defer s.outMutex.Unlock()
if atomic.LoadInt32(&s.disconnected) > 0 {
return false
}
s.out <- u
return true

If so, perhaps the channel is full/blocked while the lock is held.

@mxmp210
Copy link

mxmp210 commented Apr 19, 2023

@mholt Found possible problem [Stacktrace attached at end] :
The interface implementation is local_transport.go - we are not using bolt_transport though both have similar implementations.

It might be because of infamous http timeouts and how it was implemented here to take care of old bugs prior to go 1.20. It seems like because of hardcoded timeout causes problems - although there might be more to implementation as it also happens before 600s timeout in some cases where writer is trying to write to connection and gets lost (probably never returns) forever due to poor connection:

mercure/hub.go

Line 295 in a5d1ef8

opt := &opt{writeTimeout: 600 * time.Second}

This will possibly lead to server closing connection to current subscriber - calling shutdown method and essentially closing channel - while it's doing that other updates may be dispatched because that is how subscriptions work trying to write to channel or writing already.

There are three ways connection might exit :
Write timeout [ hardcoded to 600s - always closes the loop - not configurable]

case <-writeTimerC:

Heartbeat timeout [Keeps it alive if it can write - line 203 always returns true / stays in state without returning due to zero value problem - never closes the loop]

case <-heartbeatTimerC:

Channel closed by shutdown method [ Never closed if there's write error as function never returns ]

case update, ok := <-s.Receive():

If subscriber is same one the exiting code is trying to disconnect it by locking :

mercure/subscriber.go

Lines 111 to 121 in a5d1ef8

func (s *Subscriber) Disconnect() {
if atomic.LoadInt32(&s.disconnected) > 0 {
return
}
s.outMutex.Lock()
defer s.outMutex.Unlock()
atomic.StoreInt32(&s.disconnected, 1)
close(s.out)
}

And at the same time trying to write from scheduled routine:

mercure/subscriber.go

Lines 57 to 82 in a5d1ef8

func (s *Subscriber) Dispatch(u *Update, fromHistory bool) bool {
if atomic.LoadInt32(&s.disconnected) > 0 {
return false
}
if !fromHistory && atomic.LoadInt32(&s.ready) < 1 {
s.liveMutex.Lock()
if s.ready < 1 {
s.liveQueue = append(s.liveQueue, u)
s.liveMutex.Unlock()
return true
}
s.liveMutex.Unlock()
}
s.outMutex.Lock()
defer s.outMutex.Unlock()
if atomic.LoadInt32(&s.disconnected) > 0 {
return false
}
s.out <- u
return true
}

This is likely to block each other as write function has no clue if subscriber is disconnected or not. Because of write error close might never end the loop. This might be handled correctly with error handling.

My question is : Will http3 affect this implementation in anyways? As you can see code is using http2 writers - in early days where were problems with Read / Write timeouts which ended up in custom implementations over core logic so may be we now have better way to address this?

Another Bug : For some reason it is branching to L203 instead of L211 which suggests the copy of hub that's received to write function has zero values instead of the one coming from default or viper config.

Quoting relevant trace :

goroutine 4265103 [select, 10 minutes]:
net/http.(*http2serverConn).writeDataFromHandler(0xc0095a2680, 0xc00bc5b760, {0xc009727000, 0xc3, 0x1000}, 0x0)
	net/http/h2_bundle.go:4843 +0x205
net/http.(*http2responseWriterState).writeChunk(0xc004257e80, {0xc009727000, 0xc3, 0x1000})
	net/http/h2_bundle.go:6391 +0x4fb
net/http.http2chunkWriter.Write({0xc3?}, {0xc009727000?, 0x504a66?, 0xc00944a578?})
	net/http/h2_bundle.go:6264 +0x27
bufio.(*Writer).Flush(0xc009dd71c0)
	bufio/bufio.go:628 +0x62
net/http.(*http2responseWriter).FlushError(0xc00d008930?)
	net/http/h2_bundle.go:6528 +0x3c
net/http.(*http2responseWriter).Flush(0xc00944a5f8?)
	net/http/h2_bundle.go:6518 +0x19
github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush(...)
	github.com/caddyserver/caddy/v2@v2.6.4/modules/caddyhttp/responsewriter.go:51
github.com/dunglas/mercure.(*Hub).write(0xc00c1b3520?, {0x7f8967174268?, 0xc00cfbb0b0}, {0x1f58f80?, 0xc00955ef00?}, {0xc00017cc30?, 0x2c33f80?})
	github.com/dunglas/mercure@v0.14.6/subscribe.go:203 +0x3da
github.com/dunglas/mercure.(*Hub).SubscribeHandler(0xc000293410, {0x7f89671741d8?, 0xc00cfbb0b0}, 0xc00c483d00)
	github.com/dunglas/mercure@v0.14.6/subscribe.go:77 +0x53e
net/http.HandlerFunc.ServeHTTP(0xc00c483c00?, {0x7f89671741d8?, 0xc00cfbb0b0?}, 0xc0001a3500?)
	net/http/server.go:2122 +0x2f
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0004a26c0, {0x7f89671741d8, 0xc00cfbb0b0}, 0xc00c483b00)
	github.com/gorilla/mux@v1.8.0/mux.go:210 +0x1cf
github.com/gorilla/handlers.(*cors).ServeHTTP(0xc000040510, {0x
[stacktrace-2023-04-19-0235.txt](https://github.com/dunglas/mercure/files/11271703/stacktrace-2023-04-19-0235.txt)
7f89671741d8, 0xc00cfbb0b0}, 0xc00c483b00)
	github.com/gorilla/handlers@v1.5.1/cors.go:138 +0x6c2

Stacktrace :
stacktrace-2023-04-19-0235.txt.gz

@mholt
Copy link

mholt commented Apr 19, 2023

Excellent, thank you. Looks like @dunglas pushed a patch -- actually two. Do you want to give those a try?

Nice work both!!

@mxmp210
Copy link

mxmp210 commented Apr 20, 2023

Yes, I'm deploying those and will report back in a few days as it was a slow memory leak - if everything is fine after running it with no memory leaks we can close issue as it should also resolve any POST errors due to blocked resources.

OP can reopen if it's still an issue after confirming fixes.

@mholt Thank you for the inputs!

@rbrigot
Copy link

rbrigot commented Apr 20, 2023

Hello,
Nice work ! Thank You !
Do you know when these fixes will be available as docker image ?

@dunglas
Copy link
Owner

dunglas commented Apr 20, 2023

Thank you very much for the detailed report and the research everybody. This helped a lot. The main version now contains all the fixes. I'm working on some other slightly related fixes and improvements, and if everything goes smoothly I'll tag a new release soon.

@mholt
Copy link

mholt commented Apr 20, 2023

(Mercure's one of my favorite Caddy plugins, but don't tell anyone I said that)

@mxmp210
Copy link

mxmp210 commented Apr 24, 2023

@dunglas Update : It seems to be stable during 4 days and reached previous thresholds in updates & subscriptions.
There are no raising graphs or stuck routines slowing down server in stack trace.

Also average memory usage has been halved due to proper connection cleanup which is a plus.

image

@dunglas
Copy link
Owner

dunglas commented May 17, 2023

Fixed in Mercure 0.14.7.

@dunglas dunglas closed this as completed May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants