Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability of beacons, max-timeouts, and distributed systems #45

Closed
oscarhermoso opened this issue May 11, 2024 · 6 comments
Closed

Scalability of beacons, max-timeouts, and distributed systems #45

oscarhermoso opened this issue May 11, 2024 · 6 comments
Labels
seen I've seen and read this issue and I'll try find some time soon to work on it.

Comments

@oscarhermoso
Copy link
Contributor

Issues

There are a few issues with the current sveltekit-sse beacon/timeout implementation that make it difficult to use in large production systems:

Must fix

  1. The server-side maps timeouts and locks in events.js will only grow in memory, as expired keys are never removed
  2. Beacon timeout does not work as soon as you have multiple SvelteKit Node instances behind a load balancer, as there is no gurantee the beacon hits the same backend to extend the timeout.

Nice to have

Below issues can be solved in user-land, so definitely not critical, but raising for discussion anyway:

  1. Would be helpful to add a new maxTimeout or similar due to proxies like Cloudflare expecting responses to complete within 100 seconds.
  2. Consider adding some randomness to the beacon and maxTimeout parameters to prevent thundering-herd scenarios?
  3. Consider increasing the default timeout and beacon values. I have not been able to find a definitive guide on what is "best practice", but something like beacon: 30, timeout: 45 seems to be appropriate?

Fix

Problems 1 and 2 can be solved by following the approach of sveltekit-rate-limiter and use @isaacs/ttlcache instead of Map, with an option to swap out the cache backend.

Alternatively, could use setTimeout etc. to clear expired map entries but the option to swap out the cache backend would still be necessary for distributed systems.

If you are happy for the library to support a maxTimeout parameter, it would be good if it used the same cache implementation too.

4/5 can be split into different GitHub issues for discussion but felt they were related enough to be raised in the same ticket.

Workaround

I can work around most of these problems for now by simply setting beacon: 0, timeout: 60 + Math.random() * 30, alongside a nightly restart on my server to reset the memory usage.

@razshare razshare added the seen I've seen and read this issue and I'll try find some time soon to work on it. label May 11, 2024
@razshare
Copy link
Owner

razshare commented May 11, 2024

The server-side maps timeouts and locks in events.js will only grow in memory, as expired keys are never removed

This is a bug, I'll fix it soon.

Would be helpful to add a new maxTimeout or similar due to proxies like Cloudflare expecting responses to complete within 100 seconds.

I'll be honest, I don't think something like maxTimeout belongs into this library's api.
Userland should know what these values are for their specific provider.
Also having both timeout and maxTimeout is very confusing, it's like having a maxMaxNumber of a maxNumber, it gets pretty ridiculous pretty fast.

I do agree with this though:

Consider adding some randomness to the beacon and maxTimeout parameters to prevent thundering-herd scenarios?

I think a new beaconVariance property is best fit for this.
It would vary the actual frontend beacon by beaconVariance milliseconds, to avoid the thundering-herd effect.

Consider increasing the default timeout and beacon values. I have not been able to find a definitive guide on what is "best practice", but something like beacon: 30, timeout: 45 seems to be appropriate?

Sure, will do.

Problems 1 and 2 can be solved by following the approach of sveltekit-rate-limiter and use @isaacs/ttlcache instead of Map, with an option to swap out the cache backend.

No need for more dependencies, we already know when the server side stream stops
https://github.com/tncrazvan/sveltekit-sse/blob/d942a72807e34e7d9c747e3ec89d8fd0a3e2891c/src/lib/events.js#L134-L140

We just need to delete the map entry.

Main issue

Beacon timeout does not work as soon as you have multiple SvelteKit Node instances behind a load balancer, as there is no gurantee the beacon hits the same backend to extend the timeout.

This is an architecture issue.
Cloudflare specifically doesn't allow you to be deterministic on which servers to hit up.
It's been a while, but I think they do have something similar to an affinity cookie like Azure web apps, but it's not deterministic, it only allows you to specify a group of nodes to hit, instead of one specific node, like Azure does for example.

Take a look at this https://learn.microsoft.com/en-us/azure/application-gateway/configuration-http-settings#cookie-based-affinity

I really don't understand why Cloudflare doesn't allow for such configuration, it's a no brainer.
I can understand why the default would be non deterministic, perhaps because it would require much more powerful hardware and what not, it wouldn't make sense for their pricing, but at least an option to be deterministic at a higher pricing would be nice.

I did bump into a similar issue in the past, although we were having a slightly different use case, not SSE.
It was honestly pretty easy to fix by using cookie based affinity, we even managed to stick specific users to specific machines in order to avoid persisting user sessions to the database, but keep them in memory instead.

You can even hop between sessions if you save your affinity cookie and swap the cookie by using your own "gateway" let's say.
I put it in quotes because all it takes is just to use http headers to redirect your users with your specific affinity cookie to the main application.

There's not much I can do to help you with this part, other than: try other providers.

That being said

I might have found a way to deal away completely with beacons.

It doesn't fix your main issue regarding nodes, but it will simplify the api a lot.

PS

No need to split this into multiple issues.

@razshare
Copy link
Owner

I just released patch 0.12.13, which addresses the beacon variance issue and the memory leak.

Breaking changes will come next in order to drop beacons.

@razshare
Copy link
Owner

razshare commented May 11, 2024

Hello @oscarhermoso ,

I just released version 0.13.0.

As mentioned before - beacons are gone.

You should check the docs, the api has changed a bit.

The new mechanism through which we detect if a client is still connected or not doesn't rely on the client itself at all anymore.
So your node lookup issue shouldn't be a problem anymore, because there is no client that has to find the right machine to hit.

The detection is done by the server alone now.

Other than the data of your nodes, which you should centralize somehow (maybe through a database) everything should be consistent in a serverless environment now.

I think this should fix your main issue, can you confirm?

@oscarhermoso
Copy link
Contributor Author

I have skimmed through the code, this looks like a really good change.

I am a little too familiar with Azure's session affinity on their App Service / App Gateway / Front Door resources 😅

It is useful/necessary for a lot of legacy apps, but in the long run I have found that it hurts more than it helps. It reduces the effectiveness of load balancing, slows down connection draining, and sometimes developers over-rely on it, resulting in broken user-experiences when backend nodes are swapped out or restarted...

It is a massive improvement to remove beacons and have sveltekit-sse work in a serverless environment. Bravo 👏

I will be able to test the implementation next weekend, will leave the ticket open until then.

@razshare
Copy link
Owner

I'm closing this issue due to inactivity.
Feel free to open a new one if you're encountering more issues or have some feedback.

@oscarhermoso
Copy link
Contributor Author

Thanks @razshare, much appreciated again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
seen I've seen and read this issue and I'll try find some time soon to work on it.
Projects
None yet
Development

No branches or pull requests

2 participants