Clustering: Add clustering support with ToCloud9 integration. #504

walkline · 2024-02-20T10:55:18Z

🍰 Pullrequest

The purpose of this PR is to evaluate possible required changes to the core and discuss possible integration with ToCloud9 project. The changes don't provide full clustering support and would probably need refactoring, possibly with hiding the implementation behind a CMake flag.

The demo with these changes available here - https://www.youtube.com/watch?v=XnehuGA7Jjc

So basically, TC9 integration brings a bunch of microservices (currently 9) that run alongside mangosd and enable clustering support, allowing the distribution of players between several mangosd instances.

How does everything work? In short, microservices (and mangosd using libsidecar) use gRPC + NATS to communicate with each other. Players connect to game-load-balancer(s) instead of directly to mangosd. This game-load-balancer offloads encryption, analyzes packets, and in some cases, uses other services to handle packets or sends raw packets to mangosd via a TCP connection (the same connection that player <-> mangosd uses without clustering).
There is a lot to tell about the architecture, but unfortunately, at the moment, there is only this reading (skip to "ToCloud9 Architecture" section) that provides some architecture details (sorry, I'm a bad writer). But feel free to ask any questions.

How2Test

The installation guide will be available later.

Windows not supported (only with WSL)

You can still build the project and use it on Windows with the clustering mode disabled as you do now. However, once you enable it in config, you will receive the corresponding error message.
But you still can use WSL/Docker/etc to run it. 

How to add support for Windows?

3 options available:

Build CMaNGOS with gcc (mingw) and update cmake.
Rewrite this lib from Go to C++.
Wait for implementation one of these features in Go: cmd/cgo: support clang on Windows golang/go#17014, cmd/link: support msvc object files golang/go#20982.

Implement overriding of configuration from the .conf file with environment variables. Environment variables keys are autogenerated based on the keys defined in .conf file. Usage example: $ export CM_DATA_DIR=/usr $ CM_WORLD_SERVER_PORT=8080 ./mangosd

The new env key format: Mangosd_Rate_Health Mangosd_DataDir

insunaa · 2024-02-20T11:39:38Z

I'll edit this comment as I go along, first question I have:
Why gRPC? It's much slower than other message tools with increasing volume. The place where I can see gRPC be used most is when the communication goes through the internet, it being a protocol built on top of HTTP, but within the same local network?

insunaa · 2024-02-20T11:47:34Z

src/game/Server/WorldSession.cpp

+
+ player->SaveToDB();
+
+ // TODO: unfortunately mangos doesn't provide API to track saving progress :(


We have async db queries, we can probably make some changes to allow for SaveToDB to run a callback on completion

Yeah, it would be nice, but at the moment there is workaround - walkline/ToCloud9@fd906b5#diff-cf09120953fd248dcecc81b97b5a684f7ecba5ec076f31f6cf776dac99c9b27cR130

killerwife · 2024-02-20T11:55:21Z

This warrants a longer discussion but I will try to give at least a brief response.

First I need to state, that I am inherently against clustering in cmangos in principle due to the debugging overhead and complication it provides until most systems are finalized (which they arent by any means). Luckily for everyone involved, I can be overvoted.

Now that that is out of the way, we know the proper official clustering structure we are supposed to follow, and sadly, this is not it. The official structure is this:

Login server
Session server
Realm server
World servers (one for each continent)
Instance server (and or battlegroup server)

To achieve this a bigger restructuring of the codebase is needed and is actually something I working towards thread context wise for several years now. And it would be insane to maintain it as an open source project where I am doing 50% of the core work on my own. (hence why the hard pass from me)

Me and @cyberium are considering adding a protobuf link between realmd and mangosd for a long time, even possibly gRPC+protobuf, but during the initial step just to send session keys instead of doing it through the db. However that is an incredibly scaled down version of this proposal.

And now onto the PR.

It contains some arm stuff that should be probably separate.

Appending packets on top the existing packet handlers is heresy, would make maintaining several versions difficult. Each communication space needs to have its own opcodeset to keep the blizzlike stuff perfectly pristine. (this is unlike Acore #1 priority reverse engineering project, not commercial core)

Lastly, there is a lot of stuff that this PR ignores, like for example the group stuff, which needs to be properly re-engineered to be thread safe to begin with.

That being said, your work is commendable to pull this off, and dont let my negativity to discourage you from what you are trying to achieve.

insunaa · 2024-02-20T12:03:15Z

Just out of curiosity, in this structure, what do each of the components do?

Login server
Session server
Realm server
World servers (one for each continent)
Instance server (and or battlegroup server)

Login server I guess is what realmd currently does, and world server + instance server is what mangosd is doing, but what do Session and Realm servers do?

Also: boo gRPC >:(

cyberium · 2024-02-20T12:28:21Z

Congrats for this awesome work and results!

However iam also negative for this PR.
Idea to add a GO layer for cmangos is not what i have in mind.
Idea to not follow something more near to what KW stated as architecture too.

Your work may benefit to those who want it now or even to those who want to rewrite it in c++ using more what we have in mind as miro-services architecture.

So to be clear i do not expect this form of PR be merged any time soon. It will require a complete rethink that iam supposing you are not ready for it.

killerwife · 2024-02-20T12:50:09Z

Just out of curiosity, in this structure, what do each of the components do?

Login server

Session server

Realm server

World servers (one for each continent)

Instance server (and or battlegroup server)

Login server I guess is what realmd currently does, and world server + instance server is what mangosd is doing, but what do Session and Realm servers do?

Also: boo gRPC >:(

Session server would hold our object (session) and all the other nodes would send packets to be forwarded to client. (my lfg code in wotlk has to emulate this cos basically session is not thread safely accessable from async lfg thread)

Realm is our World object, World is our Map object, instance is also our Map objects, but those of type Instanceable (not static world maps like kalimdor, ek, outland, northrend and dk zones)

insunaa · 2024-02-20T13:01:19Z

Ahhh, ok. So Session server is basically the router / main interface to the internet and then the Realm Server is basically the singleton for GUIDs/consistency etc.?

walkline · 2024-02-20T13:59:43Z

I'll edit this comment as I go along, first question I have: Why gRPC? It's much slower than other message tools with increasing volume. The place where I can see gRPC be used most is when the communication goes through the internet, it being a protocol built on top of HTTP, but within the same local network?

I think gRPC is the first option that comes to mind when you think about synchronous communication in distributed systems (not over the internet).
It's based on the HTTP/2 protocol (not on HTTP/1, which is the default meaning when you mention HTTP), and it's much better than vanilla HTTP.
It has a nice ecosystem that allows easy integration with other components (like writing a website for an auction house that makes gRPC calls to the auction service) and using tools like service mesh, load balancers, etc.
The code generation helps a lot to not make any typo/mistake when using your protocol.
Also, the places where gRPC is used (though I think it's fast) are not sensitive to performance, like inviting to a guild/group, sending a chat message, sending mail, etc.

walkline · 2024-02-20T14:06:26Z

we know the proper official clustering structure we are supposed to follow, and sadly, this is not it. The official structure is this:

Login server

Session server

Realm server

World servers (one for each continent)

Instance server (and or battlegroup server)

That doesn't necesseiry means that we should follow it.
The blizzards architecture was created >20 years ago, I think they would make some corrections if they started this project nowedays.
Also we should take into account our current architecture. I agree with your words regarding big codebase restructuring. But with TC9 architecture major changes to the core doesn't required, but it provides similar benefits as blizzards architecture.

It contains some arm stuff that should be probably separate.

Yeah, I agree that this PR is messy, but the goal of this PR is not to be merged, but to provide you an idea of how integration works. Nonetheless, I agree that changes like arm changes don't help with that, sorry 😅

Appending packets on top the existing packet handlers is heresy, would make maintaining several versions difficult. Each communication space needs to have its own opcodeset to keep the blizzlike stuff perfectly pristine. (this is unlike Acore #1 priority reverse engineering project, not commercial core)

You referring to these changes: walkline@23d8a1a#diff-1c6279373626ee45cc739c22846da733265493a1f45f702d59140bae1f3d3abcR1349 ?
If yes, I'm totaly agree, I added this before implementing gRPC support on worldserver side, so I'm planning to move it there.

Lastly, there is a lot of stuff that this PR ignores, like for example the group stuff, which needs to be properly re-engineered to be thread safe to begin with.

Indeed, some implementations are a little bit naive and don't cover all cases, like groups (also, I didn't touch instance binding/saves, code completely different from TC). But I don't think that big groups re-engineering is required to make it work.

Also, please don't forget that the goal is to make this clustering optional (with config or CMake flag option). Making it optional gives you the ability to not care a lot about clustering since >90% of users will continue to use the current setup.
In addition to that, I think that integration with TC9 would require comparatively not a huge effort, and it will be the case with ripping clustering out of the core.

killerwife · 2024-02-20T14:18:13Z

Well, currently the best case scenario for our infrastructure is a high cache medium core bare metal cpu, like lets say 5950x, which will make it run better than any clustering without rewrites or any cloud provider. (we are single thread bound, not multithreading bound)

And here is the thing, I run my own server on vengeance, but the things I want on vengeance are kept in a separate repo because this project is exactly dedicated to making that original vision of the game with its own bugs. For example having LFG clustered like official did reproduces exact levelup bugs blizzard has. I have no problems with someone patching them in their own repo in 5 minutes, however this is a blizzlike project, not a commercial venture as I said. So on this part, even if i would agree with you on the point of this being good for server hosting (which I do not, and i strongly disagree with even the leader of acore on this) that still would not mean I would want it to be in cmangos. (something something purity of principle)

(also note to insunaa, grpc can run on http/3 (QUIC), which is even more efficient)

walkline · 2024-02-20T19:28:23Z

Congrats for this awesome work and results!

However iam also negative for this PR. Idea to add a GO layer for cmangos is not what i have in mind. Idea to not follow something more near to what KW stated as architecture too.

Your work may benefit to those who want it now or even to those who want to rewrite it in c++ using more what we have in mind as miro-services architecture.

So to be clear i do not expect this form of PR be merged any time soon. It will require a complete rethink that iam supposing you are not ready for it.

@cyberium Thanks for feedback.
I see some benefits from rewriting libsidecar from Go to C++, but having microservices in C++ wouldn't bring a lot of benefits to the table.

Well, currently the best case scenario for our infrastructure is a high cache medium core bare metal cpu, like lets say 5950x, which will make it run better than any clustering without rewrites or any cloud provider. (we are single thread bound, not multithreading bound)

@killerwife That’s true, the current setup works fine for most users. However, there are several things to consider:

The current setup has limit. With the suggested architecture, you have the option for horizontal scalability that increases this limit.
The suggested architecture enables high availability.
Cross-realm support. For example, implementing cross-realm BG with the suggested architecture is not that big deal.

Depending on the person, it can bring some new knowledge and fun (maybe this is only relevant for me 😁).

But I understand your concerns regarding maintenance and complexity (maybe with a bit of an overreaction), and since there are no maintainers interested in this (which is totally fine), then I think it’s better to close this PR (but the discussion may continue).

Anyway, if someone is interested (or changes their mind) in this, I will be glad to discuss/cooperate.

cyberium · 2024-02-20T21:16:53Z

Thank you for your understanding and for considering the feedback. While we may have different views on the architecture and direction of the project, I truly appreciate your contribution and the effort you've put into this PR.

Although this specific PR may not align with our current plans for the repository, your insights and suggestions are valuable for our future development efforts. We're always open to hearing your thoughts and ideas, and I hope we can continue to benefit from your expertise and contributions moving forward.

Thank you again for your work on this, and I look forward to collaborating with you on future endeavors.

Best regards

walkline added 4 commits February 16, 2024 12:57

Update env var key format and encapsulated env loading logic.

8245aa4

The new env key format: Mangosd_Rate_Health Mangosd_DataDir

Add sidecarlib and initial clustering support.

23d8a1a

More clustering progress

2823134

walkline marked this pull request as draft February 20, 2024 10:55

insunaa reviewed Feb 20, 2024

View reviewed changes

walkline closed this Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering: Add clustering support with ToCloud9 integration. #504

Clustering: Add clustering support with ToCloud9 integration. #504

walkline commented Feb 20, 2024 •

edited

Loading

insunaa commented Feb 20, 2024

insunaa Feb 20, 2024

walkline Feb 20, 2024

killerwife commented Feb 20, 2024

insunaa commented Feb 20, 2024

cyberium commented Feb 20, 2024

killerwife commented Feb 20, 2024

insunaa commented Feb 20, 2024

walkline commented Feb 20, 2024

walkline commented Feb 20, 2024

killerwife commented Feb 20, 2024

walkline commented Feb 20, 2024

Congrats for this awesome work and results!

cyberium commented Feb 20, 2024 •

edited

Loading


		player->SaveToDB();

		// TODO: unfortunately mangos doesn't provide API to track saving progress :(

Clustering: Add clustering support with ToCloud9 integration. #504

Clustering: Add clustering support with ToCloud9 integration. #504

Conversation

walkline commented Feb 20, 2024 • edited Loading

🍰 Pullrequest

How2Test

Windows not supported (only with WSL)

How to add support for Windows?

insunaa commented Feb 20, 2024

insunaa Feb 20, 2024

Choose a reason for hiding this comment

walkline Feb 20, 2024

Choose a reason for hiding this comment

killerwife commented Feb 20, 2024

insunaa commented Feb 20, 2024

cyberium commented Feb 20, 2024

Congrats for this awesome work and results!

killerwife commented Feb 20, 2024

insunaa commented Feb 20, 2024

walkline commented Feb 20, 2024

walkline commented Feb 20, 2024

killerwife commented Feb 20, 2024

walkline commented Feb 20, 2024

Congrats for this awesome work and results!

cyberium commented Feb 20, 2024 • edited Loading

walkline commented Feb 20, 2024 •

edited

Loading

cyberium commented Feb 20, 2024 •

edited

Loading