Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement clustering for horizontal scalability #1532

Open
slingamn opened this issue Feb 10, 2021 · 15 comments
Open

implement clustering for horizontal scalability #1532

slingamn opened this issue Feb 10, 2021 · 15 comments
Milestone

Comments

@slingamn
Copy link
Member

Motivation

The original model of IRC as a distributed system was as an open federation of symmetrical, equally privileged peer servers. This model failed almost immediately with the 1990 EFnet split. Modern IRC networks are under common management: they require all the server operators to agree on administration and policy issues. Similarly, the model of IRC as an AP system (available and partition-tolerant) failed with the introduction of services frameworks. In modern IRC networks, the services framework is a SPOF: if it fails, the network remains available but is dangerously degraded (in particular, its security properties have been silently weakened).

The current Oragono architecture is a single process. A single Oragono instance can scale comfortably to 10,000 clients and 2,000 clients per channel: you can push those limits if you have bigger hardware, but ultimately the single instance is a serious bottleneck. The largest IRC network of all time was 2004-era Quakenet, with 240,000 concurrent clients. Biella Coleman reports that the largest IRC channel of all time had 7,000 participants (#operationpayback on AnonOps in late 2010). This gives us our initial scalability targets: 250,000 concurrent clients with 10,000 clients per channel.

Oragono's single-process architecture offers compelling advantages in terms of flexibility and pace of development; it's significantly easier to prototype new features without having to worry about distributed systems issues. The architecture that balances all of these considerations --- acknowledging the need for centralized management, acknowledging the indispensability of user accounts, providing the horizontal scalability we need, and minimizing implementation complexity --- is a hub-and-spoke design.

Design

The oragono executable will accept two modes of operation: "root" and "leaf". The root mode of operation will be much like the present single-process mode. In leaf mode, the process will not have direct access to a config file or a buntdb database: it will take the IP (typically a virtual IP of some kind, as in Kubernetes) of the root node, then connect to the root node and receive a serialized copy of the root node's authoritative configuration over the network.

Clients will then connect to the leaf nodes. Most commands received by the leaf nodes will be serialized and passed through to the root node, which will process them and return an answer. A few commands, like capability negotiation, will be handled locally. (This corresponds loosely to the Session vs. Client distinction in the current codebase: the leaf node will own the Session and the root node will own the Client. Anything affecting global server state is processed by the root; anything affecting only the client's connection is processed by the leaf.)

The root node will unconditionally forward copies of all IRC messages (PRIVMSG, NOTICE, KICK, etc.) to each leaf node, which will then determine which sessions are eligible to receive them. This provides the crucial fan-out that generates the horizontal scalability: the traffic (and TLS) burden is divided evenly among the n leaf nodes.

Unresolved questions

The intended deployment strategy for this system is Kubernetes. However, I don't currently have a complete picture of which Kubernetes primitives will be used. The key assumption of the design is that the network can be virtualized such that the leaf nodes only need to know a single, consistent IP address for the root node.

I'm not sure how to do history. The simplest architecture is for HISTORY and CHATHISTORY requests to be forwarded to the root. This seems like it will result in a scalability bottleneck. In a deployment that uses MySQL, the leaf nodes can connect directly to MySQL; this reduces the problem to deploying a highly available and scalable MySQL, which is nontrivial. The logical next step would seemingly be to abstract away the historyDB interface, then provide a Cassandra implementation --- the problem with this is that we seem to be going in the direction of expecting read-after-write consistency from the history store (see this comment on #393 in particular.)

See also

This supersedes #343, #1000, and #1265; it's related to #747.

@slingamn slingamn added this to the v2.6 milestone Feb 10, 2021
This was referenced Feb 10, 2021
@DanielOaks
Copy link
Member

DanielOaks commented Feb 11, 2021

I like the sound of most of that proposal.

So, what sort of info usually differs for each server on the network, in traditional IRC nets?

  • server name (blah.oragono.io, aus.oragono.io, etc) - I do like this, if nothing else because it helps users work out where they are and when they start running into any trouble they can explicitly know which server they're on and try others.
  • listening ports/setups? the exposed ports shouldn't change ofc but I'm considering for more in-depth or precise routing setups. this probably isn't necessary at all?

anything else?


optimally, not sending an IRC-based line protocol on the wire and doing something more efficient would be nice? if we do need to, that's fine. optimally we'll be able to use the same fairly-simple command handlers like we do now to parse incoming client messages.


The root node will unconditionally forward copies of all IRC messages (PRIVMSG, NOTICE, KICK, etc.) to each leaf node, which will then determine which sessions are eligible to receive them. This provides the crucial fan-out that generates the horizontal scalability: the traffic (and TLS) burden is divided evenly among the n leaf nodes.

I'm not super familiar with how scalability is affected by different factors, but is there any reason why a leaf that has e.g. 20 clients on it would be more stable if it gets the full 50k-client-traffic piped to it vs getting a smaller amount of traffic if the hub has some logic to work out which leafs should get which traffic? or is it something like 'reducing the burden on the hub (not needing to keep in mind which leafs have which clients, not needing to spend time directing messages) provides more benefit to the network as a whole than reducing the burden on specific leaf connections'?


Just to make sure that we're on the same page - if Ora is in hub mode it'll accept connections from oragono servers but not clients. if Ora is in leaf mode it'll accept connections from clients only, yeah? the standard irc padiagram has this feature where users can connect to hubs, which is sometimes useful for opers sitting on those hubs to e.g. fix the network or stuff like that, and sometimes you'll see servers that both accept clients (y'know sitting there with a few thousand normal clients connected) and also being the hub to 2-3 other leaf servers. It probably makes sense for us to just have a clean split, either hub or leaf and that means either accept servers or accept clients, done.


thoughts on the 'server name' stuff again, this'd probably make sense as part of the connection authentication stuff that leaves do to auth to the hub. e.g. something along the lines of this in the hub config:

leaves:
    "password-hash-here":
        name: aus.oragono.io
        connects-from: 184.5.6.78/24
        ... blahblahblah ...

or just a more normal-looking

leaves:
    aus.oragono.io:
        password: "password-hash-here"
        connects-from: 184.5.6.78/20
        ... blahblahblah ...

or are we gonna go the no-passwords-to-connect route and tell everyone to setup internal vpns and junk to expose the hub to the leaves or something instead?

@slingamn
Copy link
Member Author

  1. I wasn't planning to expose server names for the leaf nodes --- they're "cattle", not "pets", in k8s parlance. (In particular, they might be automatically spun up and down by an autoscaler.)
  2. I'm considering serialized wire formats that aren't IRC for the leaf-to-root links, yeah. In particular there is a need to attach metadata (like a session UUID) to the line. This could be done with tags but it seems like an abuse.
  3. The objective of unconditionally forwarding all messages to all leaves is to reduce the burden on the hub, yeah. (Also in terms of implementation complexity: it removes a potential class of bugs where the hub is confused about what clients are where.)
  4. I think it would be potentially OK for the hub to accept C2S connections, but we'll have to see how the implementation shakes out.
  5. Yeah, I was planning to leave all authentication out of scope and rely on internal cloud networking or VPNs. I think if we don't do that, the next option is probably MTLS with a private CA.

@slingamn
Copy link
Member Author

Something I hadn't considered adequately before: this design requires the leaf nodes to have an up-to-date view of channel memberships. This shouldn't be too difficult but it introduces a potential class of desync bugs.

@slingamn
Copy link
Member Author

In the MVP of this, all password checking will happen on the hub, but later on we'll probably want to have a mechanism to defer it to the leaf (in-memory cache of accounts and hashed passwords?)

@DanielOaks
Copy link
Member

  1. ah, 'cattle not pets' makes sense then, mad.
  2. yeah not-irc for the s2s format makes sense for us. traditional s2s protocols do weird things that change the original line a fair bit and all, if we can avoid that then that'd be cool.
  3. 👍
  4. yeah fair enough, we'll see how impl shakes out. might be useful for an oper or whatever to connect directly and not via a leaf potentially.
  5. 👍

@tacerus
Copy link
Contributor

tacerus commented Feb 11, 2021

Hi! This all sounds very interesting, and we can't wait to test it out. Did I understand correctly, that all IRC clients will connect exclusively to leaf-nodes, and non will connect to the root directly? If yes, will it be feasible to have one leaf on the same server as the root?

@slingamn
Copy link
Member Author

If yes, will it be feasible to have one leaf on the same server as the root?

Yes, but what would this accomplish?

@slingamn
Copy link
Member Author

Oh --- I didn't mean to suggest that the old single-process mode of operation will no longer be supported.

@DanielOaks
Copy link
Member

DanielOaks commented Feb 11, 2021 via email

@tacerus
Copy link
Contributor

tacerus commented Feb 12, 2021

I was more thinking, one would ideally have at least two leafs in such a configuration (else one could just run the single mode of operation) - if the resources of the root server are good enough, could someone just run one of their leafs on it, or are there any drawbacks of doing so?

@tacerus
Copy link
Contributor

tacerus commented Feb 12, 2021

@DanielOaks ' point is a good one too though - given that your root server is for some reason more reliable than the others.

@slingamn
Copy link
Member Author

if the resources of the root server are good enough, could someone just run one of their leafs on it, or are there any drawbacks of doing so?

My intuition is that in this setup, the leaf will compete with the master for resources and you might be better off running a single node. But I'm not sure.

@slingamn
Copy link
Member Author

Answering some other questions received:

  1. Replacing the root server will be fairly disruptive, unfortunately; the network will go down for a period on the order of a minute. It should be possible to minimize the extent to which this occurs (StatefulSet seems potentially relevant)
  2. Upgrading the cluster will also cause downtime on the order of minutes.

@slingamn slingamn removed the release blocker Blocks release label Mar 11, 2021
@slingamn slingamn modified the milestones: v2.6, v2.7 Apr 7, 2021
@slingamn slingamn modified the milestones: v2.7, v2.8 May 30, 2021
@slingamn slingamn modified the milestones: v2.8, v2.9 Aug 26, 2021
@slingamn slingamn modified the milestones: v2.9, v2.10 Jan 2, 2022
@janc13
Copy link

janc13 commented Jan 9, 2022

If I understand correctly this means that if the root node goes down (e.g. for a reboot) or becomes unreachable (e.g. network issues), then the whole IRC network goes down with it?

The original IRC federation design was not (only) to be able to scale horizontally, but also to be resilient to such infrastructure outages (and we’ve seen over the years that even the biggest and most experienced internet companies aren’t immune to such calamities…).

@slingamn
Copy link
Member Author

slingamn commented Jan 9, 2022

As discussed in the issue description, this is in fact the status quo for conventional IRC networks: the services node is a SPOF for the entire network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants