-
-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MercureHub on Timeout #758
Comments
That's weird. That may (?) be a deadlock? What happens if you restart the hub? If you can reproduce, could you enable the debug mode and start a profile to see what's going on? https://mercure.rocks/docs/hub/debug |
We already restarted the hub, and it has solved the problem, but it will probably returns... Thanks for the tip, we just enabled the debug mode, it could help us when it'll happen |
I can confirm something similar as well, if hub is running for long time(>1 Week ), it slows down significantly and requests just take forever to process - restarting hub resets everything. This has been happening quite few times recently on chat app I'm working on. We have quite bit of updates that's happening every time. The local transport is used so it is just a relay machine - no disk is involved in config (i.e. no bolt transport). In our case publishing and subscription happens but request / update times are > 6s to > 20s which suggests either caddy or mercure is having internal resource issues or we are missing something in config that's keeping hub busy ( which is I'm trying to investigate ) I was reluctant to report issue till it's figure out if it was config but it seems other users are having same outcome. |
@mxmp210 could you take a memory profile when it happens? This would help a lot: https://mercure.rocks/docs/hub/debug |
Could you also enable the metrics and share the data you get before and after the restart please? |
@dunglas We are already collecting metrics - which tells correlation between connections, memory and subscriptions. Enabling debug in production would be a bit hard - the original hub was moved yesterday without my knowledge so it is going to take some time before it starts to congest inside after several days of running. Here are memory profile from metrics that were collected in past 15 days - with 3x hub restarts when it became too slow. Go Ideal Bytes [This is interesting one as we see there's alot of memory sitting around while subs come and go away] : Go Routines [ For No of updates processed - peaks are because of system trying to push all pending updates] Some other stats : If you need more detailed data - let me know. |
One Note : There are times when some connections do not timeout (usually 10~50 connections) - there's no proxy involved and caddy is directly communicating with outside world running natively in machine. Then there are times this number keeps increasing without any notice. The server has same amount of connections since past few months - having peaks at several hours and settling down to average connections which constantly hits mercure with changing subscriptions. It started happening probably after internal caddy version was changed - though we compile from source so latest caddy is always used. |
Thanks for the details! Could you tell us the exact version of Caddy you use ( cc @mholt @francislavoie as it looks related to a recent change in Caddy. |
Thanks; could we have some more information: What Caddy version? Did this start happening after an upgrade (from which version)? Are these websockets (I know Mercure uses SSE -- but I'm asking anyway)? What is the Caddy config? How can we minimally reproduce this? |
@mholt Current Deployed Caddy Version : Deployment has all Standard Caddy Modules with Mercure as custom module ...
Standard modules: 100
http.handlers.mercure
Non-standard modules: 1
Unknown modules: 0 Caddy Config loaded in JSON format to omit INFO messages: {
"logging": {
"logs": {
"default": {
"exclude": [
"http.log.access.logFiltered",
"http.handlers.mercure"
]
},
"logFiltered": {
"level": "ERROR",
"include": [
"http.log.access.logFiltered"
]
}
}
},
"apps": {
"http": {
"servers": {
"srv0": {
"listen": [
":443"
],
"routes": [
{
"match": [
{
"host": [
"<domain_name>"
]
}
],
"handle": [
{
"handler": "subroute",
"routes": [
{
"handle": [
{
"handler": "subroute",
"routes": [
{
"handle": [
{
"cors_origins": [
"<domain_name>"
],
"handler": "mercure",
"publish_origins": [
"*"
],
"publisher_jwt": {
"key": "!ChangeMe!"
},
"subscriber_jwt": {
"key": "!ChangeMe!"
},
"transport_url": "local://local",
"subscriptions": true,
"write_timeout": 0
}
]
},
{
"handle": [
{
"handler": "subroute",
"routes": [
{
"handle": [
{
"handler": "metrics"
}
]
}
]
}
],
"match": [
{
"path": [
"/metrics"
]
}
]
},
{
"handle": [
{
"handler": "static_response",
"status_code": 200
}
],
"match": [
{
"path": [
"/healthz"
]
}
]
},
{
"handle": [
{
"body": "Not Found",
"handler": "static_response",
"status_code": 404
}
]
}
]
},
{
"handler": "file_server",
"hide": [
"/etc/caddy/CaddyFile.json"
]
}
]
}
]
}
],
"terminal": true
}
],
"logs": {
"logger_names": {
"<domain_name>": "logFiltered"
}
}
}
}
}
}
} Note : In earlier build it failed to start with greater write_timeout so it was set to zero. New one follows same suite as given in config - which keeps connection open unless interrupted. This could be possible issue with open & stalled connections but we always had difference < 100 connections as traffic scales. How to minimally Reproduce it?
For gataling test it can translate to following config [Not tested for long time] : /** Number of concurrent subscribers initially connected */
val InitialSubscribers =
Properties.envOrElse("INITIAL_SUBSCRIBERS", "10").toInt
/** Additional subscribers rate (per second) */
val SubscribersRateFrom =
Properties.envOrElse("SUBSCRIBERS_RATE_FROM", "2").toInt
val SubscribersRateTo =
Properties.envOrElse("SUBSCRIBERS_RATE_TO", "5").toInt
/** Publishers rate (per second) */
val PublishersRateFrom =
Properties.envOrElse("PUBLISHERS_RATE_FROM", "8").toInt
val PublishersRateTo = Properties.envOrElse("PUBLISHERS_RATE_TO", "10").toInt
/** Duration of injection (in seconds) */
val InjectionDuration =
Properties.envOrElse("INJECTION_DURATION", "600").toInt
/** How long a subscriber can stay connected at max (in seconds) */
val ConnectionDuration =
Properties.envOrElse("CONNECTION_DURATION", "7200").toInt Though tests are not that smart creating real-world load situations as it only published to one topic and multiple subscribers are listening to same resource. As this happens over very long time period after several million messages and few thousand subscriptions - something similar is required to reproduce accurate results. |
@mxmp210 Thanks, that's helpful. That's a lot of time and work to repro. but I'll try to go with the information provided so far:
I'm not quite sure I understand; do you mean you were using 2.6.3 and it was working fine, but then in 2.6.4 the issue started happening? Also, do any of these requests use a Chunked encoding? (Transfer-Encoding: chunked) Or, are the backends that the events are coming from using php/fastcgi? |
@mxmp210 Are you sure that you're using WebSockets? It's very unlikely, Mercure used SSE which are an totally different tech than WebSockets. |
@mholt It might not be related to latest updates but other parts of caddy or mercure which handles connections pool - we are not using any compression as you can see negotiation headers between client and server (i.e. no Last crash had about 3M updates and more than 2.5M subscriptions handled before it started to act slow. @dunglas Calling it WebSockets would be a bit confusing, it is EventStream sent via raw connection - it is indeed SSE and not to be confused with other socket based tech. Note that this happens to POST requests only, they become too slow (2s to 6s per request), Once caddy is able to take request sending to the clients is done instantly (i.e. mercure part dispatches updates to topics instantly). We have grabbed metrics and other endpoints which are returned without delays. |
@mholt @dunglas I have an update to share with debug profiles - As expected Mercure tripped again after several days of usage. Once hub starts to leak memory it would create stalled connections and subscription API does not report them but metrics API will counts those hidden connections - so we know those are there in memory. This morning we let it run to occupy more memory and connections were not slowed yet. It happens when we go near 1K connections and significant amount of updates. Here's common pattern that was observed from metrics [ Posting here so others can comment on it] :
Note for consideration : We have identical traffic pattern everyday with similar amount of connections. [See metrics below ] Debug Profile after 1 day of new deployment :pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz Debug Profile after few hours of memory leak :pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.005.pb.gz Debug Profile after 18 hours of memory leak :pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.007.pb.gz I'm not sure if this is correct but the following part keeps increasing after leak occurs : This morning [About 500MB occupied after leak started] : As of writing after about 18 hours [ 860MB+ Occupied and we are not yet at peak hours ] ======================================================== Metrics For comparison GREEN is OLD hub with no write_timeout, YELLOW is new HUB with write_timeout set to 5minutes as per @dunglas suggestion : Subscriptions Handled [ HTTP Requests in terms of SSE handled ] Connections Pattern - i.e. Active HTTP connections on server |
Excellent, thank you. I will give this a closer look when I'm in the office tomorrow 👍 |
Thanks I'll wait for the insights. Some topics are stuck [probably waiting for lock to be removed] i.e. not processing any writes to clients and subscription endpoint is returning empty results (i.e. mercure isn't appending connection to subscriber list of certain topics) even when clients are connected. I'm appending one more snapshot with latest values this time with goroutine profile as well. Hope this will tells us more about what's going on. As of writing stats are : Routines that are waiting [Seems unrealistic as there are not 2.7k active connections or even half of that] : Go Routines : Memory Snapshot : |
@mxmp210 Do you have a "full goroutine dump"? It'd be useful to get the stack traces for all goroutines during the leak. The binary profile you've provided is helpful but only generates the chart AFAIK (maybe it has a full stack dump but I'm not sure how to read it -- I would be surprised if it does though given the size of the file). I'd expect something in plain text in this case. The profile is still helpful though, it points to lock contention in the mercure package's |
If I were to guess, maybe the lock contention is in here: Lines 73 to 81 in a5d1ef8
If so, perhaps the channel is full/blocked while the lock is held. |
@mholt Found possible problem [Stacktrace attached at end] : It might be because of infamous http timeouts and how it was implemented here to take care of old bugs prior to go 1.20. It seems like because of hardcoded timeout causes problems - although there might be more to implementation as it also happens before 600s timeout in some cases where writer is trying to write to connection and gets lost (probably never returns) forever due to poor connection: Line 295 in a5d1ef8
This will possibly lead to server closing connection to current subscriber - calling shutdown method and essentially closing channel - while it's doing that other updates may be dispatched because that is how subscriptions work trying to write to channel or writing already. There are three ways connection might exit : Line 64 in a5d1ef8
Heartbeat timeout [Keeps it alive if it can write - line 203 always returns true / stays in state without returning due to zero value problem - never closes the loop] Line 70 in a5d1ef8
Channel closed by shutdown method [ Never closed if there's write error as function never returns ] Line 76 in a5d1ef8
If subscriber is same one the exiting code is trying to disconnect it by locking : Lines 111 to 121 in a5d1ef8
And at the same time trying to write from scheduled routine: Lines 57 to 82 in a5d1ef8
This is likely to block each other as write function has no clue if subscriber is disconnected or not. Because of write error close might never end the loop. This might be handled correctly with error handling. My question is : Will http3 affect this implementation in anyways? As you can see code is using http2 writers - in early days where were problems with Read / Write timeouts which ended up in custom implementations over core logic so may be we now have better way to address this? Another Bug : For some reason it is branching to L203 instead of L211 which suggests the copy of hub that's received to write function has zero values instead of the one coming from default or viper config. Quoting relevant trace :
Stacktrace : |
Excellent, thank you. Looks like @dunglas pushed a patch -- actually two. Do you want to give those a try? Nice work both!! |
Yes, I'm deploying those and will report back in a few days as it was a slow memory leak - if everything is fine after running it with no memory leaks we can close issue as it should also resolve any POST errors due to blocked resources. OP can reopen if it's still an issue after confirming fixes. @mholt Thank you for the inputs! |
Hello, |
Thank you very much for the detailed report and the research everybody. This helped a lot. The |
(Mercure's one of my favorite Caddy plugins, but don't tell anyone I said that) |
@dunglas Update : It seems to be stable during 4 days and reached previous thresholds in updates & subscriptions. Also average memory usage has been halved due to proper connection cleanup which is a plus. |
Fixed in Mercure 0.14.7. |
Hello MercureTeam and the community,
We installed and integrated it on a Docker container, and now we are on the testing stage, near to go live.
All was OK until we've faced an incomprehensible bug and we need your help to understand :
Mercure completely stopped to treat our data, impossible to push any message on Mercure hub. Our publisher workers are meeting a timeout. Mercure container is still up, and Caddy healthz route is responding a 200.
The only log we got is :
{"level":"debug","ts":1680709312.2403877,"logger":"http.handlers.mercure","msg":"write timeout: close the connection","subscriber":{"id":"urn:uuid:5fd605c2-7762-4819-839d-a3eca847488a","last_event_id":""...
No way to know why our Mercure Hub is not callable.
The text was updated successfully, but these errors were encountered: