Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering: Add clustering support with ToCloud9 integration. #504

Closed
wants to merge 4 commits into from

Conversation

walkline
Copy link
Contributor

@walkline walkline commented Feb 20, 2024

🍰 Pullrequest

The purpose of this PR is to evaluate possible required changes to the core and discuss possible integration with ToCloud9 project. The changes don't provide full clustering support and would probably need refactoring, possibly with hiding the implementation behind a CMake flag.

The demo with these changes available here - https://www.youtube.com/watch?v=XnehuGA7Jjc

So basically, TC9 integration brings a bunch of microservices (currently 9) that run alongside mangosd and enable clustering support, allowing the distribution of players between several mangosd instances.

How does everything work? In short, microservices (and mangosd using libsidecar) use gRPC + NATS to communicate with each other. Players connect to game-load-balancer(s) instead of directly to mangosd. This game-load-balancer offloads encryption, analyzes packets, and in some cases, uses other services to handle packets or sends raw packets to mangosd via a TCP connection (the same connection that player <-> mangosd uses without clustering).
There is a lot to tell about the architecture, but unfortunately, at the moment, there is only this reading (skip to "ToCloud9 Architecture" section) that provides some architecture details (sorry, I'm a bad writer). But feel free to ask any questions.

How2Test

The installation guide will be available later.

Windows not supported (only with WSL)

You can still build the project and use it on Windows with the clustering mode disabled as you do now. However, once you enable it in config, you will receive the corresponding error message.
But you still can use WSL/Docker/etc to run it.


How to add support for Windows?

3 options available:

  1. Build CMaNGOS with gcc (mingw) and update cmake.
  2. Rewrite this lib from Go to C++.
  3. Wait for implementation one of these features in Go: cmd/cgo: support clang on Windows golang/go#17014, cmd/link: support msvc object files golang/go#20982.

Implement overriding of configuration from the .conf file with environment variables.
Environment variables keys are autogenerated based on the keys defined in .conf file.
Usage example:
$ export CM_DATA_DIR=/usr
$ CM_WORLD_SERVER_PORT=8080 ./mangosd
The new env key format:
Mangosd_Rate_Health
Mangosd_DataDir
@walkline walkline marked this pull request as draft February 20, 2024 10:55
@insunaa
Copy link
Contributor

insunaa commented Feb 20, 2024

I'll edit this comment as I go along, first question I have:
Why gRPC? It's much slower than other message tools with increasing volume. The place where I can see gRPC be used most is when the communication goes through the internet, it being a protocol built on top of HTTP, but within the same local network?


player->SaveToDB();

// TODO: unfortunately mangos doesn't provide API to track saving progress :(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have async db queries, we can probably make some changes to allow for SaveToDB to run a callback on completion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would be nice, but at the moment there is workaround - walkline/ToCloud9@fd906b5#diff-cf09120953fd248dcecc81b97b5a684f7ecba5ec076f31f6cf776dac99c9b27cR130

@killerwife
Copy link
Contributor

This warrants a longer discussion but I will try to give at least a brief response.

First I need to state, that I am inherently against clustering in cmangos in principle due to the debugging overhead and complication it provides until most systems are finalized (which they arent by any means). Luckily for everyone involved, I can be overvoted.

Now that that is out of the way, we know the proper official clustering structure we are supposed to follow, and sadly, this is not it. The official structure is this:

  1. Login server
  2. Session server
  3. Realm server
  4. World servers (one for each continent)
  5. Instance server (and or battlegroup server)

To achieve this a bigger restructuring of the codebase is needed and is actually something I working towards thread context wise for several years now. And it would be insane to maintain it as an open source project where I am doing 50% of the core work on my own. (hence why the hard pass from me)

Me and @cyberium are considering adding a protobuf link between realmd and mangosd for a long time, even possibly gRPC+protobuf, but during the initial step just to send session keys instead of doing it through the db. However that is an incredibly scaled down version of this proposal.

And now onto the PR.

It contains some arm stuff that should be probably separate.

Appending packets on top the existing packet handlers is heresy, would make maintaining several versions difficult. Each communication space needs to have its own opcodeset to keep the blizzlike stuff perfectly pristine. (this is unlike Acore #1 priority reverse engineering project, not commercial core)

Lastly, there is a lot of stuff that this PR ignores, like for example the group stuff, which needs to be properly re-engineered to be thread safe to begin with.

That being said, your work is commendable to pull this off, and dont let my negativity to discourage you from what you are trying to achieve.

@insunaa
Copy link
Contributor

insunaa commented Feb 20, 2024

Just out of curiosity, in this structure, what do each of the components do?

  1. Login server
  2. Session server
  3. Realm server
  4. World servers (one for each continent)
  5. Instance server (and or battlegroup server)

Login server I guess is what realmd currently does, and world server + instance server is what mangosd is doing, but what do Session and Realm servers do?

Also: boo gRPC >:(

@cyberium
Copy link
Member

Congrats for this awesome work and results!

However iam also negative for this PR.
Idea to add a GO layer for cmangos is not what i have in mind.
Idea to not follow something more near to what KW stated as architecture too.

Your work may benefit to those who want it now or even to those who want to rewrite it in c++ using more what we have in mind as miro-services architecture.

So to be clear i do not expect this form of PR be merged any time soon. It will require a complete rethink that iam supposing you are not ready for it.

@killerwife
Copy link
Contributor

Just out of curiosity, in this structure, what do each of the components do?

  1. Login server
  2. Session server
  3. Realm server
  4. World servers (one for each continent)
  5. Instance server (and or battlegroup server)

Login server I guess is what realmd currently does, and world server + instance server is what mangosd is doing, but what do Session and Realm servers do?

Also: boo gRPC >:(

Session server would hold our object (session) and all the other nodes would send packets to be forwarded to client. (my lfg code in wotlk has to emulate this cos basically session is not thread safely accessable from async lfg thread)

Realm is our World object, World is our Map object, instance is also our Map objects, but those of type Instanceable (not static world maps like kalimdor, ek, outland, northrend and dk zones)

@insunaa
Copy link
Contributor

insunaa commented Feb 20, 2024

Ahhh, ok. So Session server is basically the router / main interface to the internet and then the Realm Server is basically the singleton for GUIDs/consistency etc.?

@walkline
Copy link
Contributor Author

I'll edit this comment as I go along, first question I have: Why gRPC? It's much slower than other message tools with increasing volume. The place where I can see gRPC be used most is when the communication goes through the internet, it being a protocol built on top of HTTP, but within the same local network?

I think gRPC is the first option that comes to mind when you think about synchronous communication in distributed systems (not over the internet).
It's based on the HTTP/2 protocol (not on HTTP/1, which is the default meaning when you mention HTTP), and it's much better than vanilla HTTP.
It has a nice ecosystem that allows easy integration with other components (like writing a website for an auction house that makes gRPC calls to the auction service) and using tools like service mesh, load balancers, etc.
The code generation helps a lot to not make any typo/mistake when using your protocol.
Also, the places where gRPC is used (though I think it's fast) are not sensitive to performance, like inviting to a guild/group, sending a chat message, sending mail, etc.

@walkline
Copy link
Contributor Author

we know the proper official clustering structure we are supposed to follow, and sadly, this is not it. The official structure is this:

  1. Login server
  2. Session server
  3. Realm server
  4. World servers (one for each continent)
  5. Instance server (and or battlegroup server)

That doesn't necesseiry means that we should follow it.
The blizzards architecture was created >20 years ago, I think they would make some corrections if they started this project nowedays.
Also we should take into account our current architecture. I agree with your words regarding big codebase restructuring. But with TC9 architecture major changes to the core doesn't required, but it provides similar benefits as blizzards architecture.

It contains some arm stuff that should be probably separate.

Yeah, I agree that this PR is messy, but the goal of this PR is not to be merged, but to provide you an idea of how integration works. Nonetheless, I agree that changes like arm changes don't help with that, sorry 😅

Appending packets on top the existing packet handlers is heresy, would make maintaining several versions difficult. Each communication space needs to have its own opcodeset to keep the blizzlike stuff perfectly pristine. (this is unlike Acore #1 priority reverse engineering project, not commercial core)

You referring to these changes: walkline@23d8a1a#diff-1c6279373626ee45cc739c22846da733265493a1f45f702d59140bae1f3d3abcR1349 ?
If yes, I'm totaly agree, I added this before implementing gRPC support on worldserver side, so I'm planning to move it there.

Lastly, there is a lot of stuff that this PR ignores, like for example the group stuff, which needs to be properly re-engineered to be thread safe to begin with.

Indeed, some implementations are a little bit naive and don't cover all cases, like groups (also, I didn't touch instance binding/saves, code completely different from TC). But I don't think that big groups re-engineering is required to make it work.

Also, please don't forget that the goal is to make this clustering optional (with config or CMake flag option). Making it optional gives you the ability to not care a lot about clustering since >90% of users will continue to use the current setup.
In addition to that, I think that integration with TC9 would require comparatively not a huge effort, and it will be the case with ripping clustering out of the core.

@killerwife
Copy link
Contributor

Well, currently the best case scenario for our infrastructure is a high cache medium core bare metal cpu, like lets say 5950x, which will make it run better than any clustering without rewrites or any cloud provider. (we are single thread bound, not multithreading bound)

And here is the thing, I run my own server on vengeance, but the things I want on vengeance are kept in a separate repo because this project is exactly dedicated to making that original vision of the game with its own bugs. For example having LFG clustered like official did reproduces exact levelup bugs blizzard has. I have no problems with someone patching them in their own repo in 5 minutes, however this is a blizzlike project, not a commercial venture as I said. So on this part, even if i would agree with you on the point of this being good for server hosting (which I do not, and i strongly disagree with even the leader of acore on this) that still would not mean I would want it to be in cmangos. (something something purity of principle)

(also note to insunaa, grpc can run on http/3 (QUIC), which is even more efficient)

@walkline
Copy link
Contributor Author

Congrats for this awesome work and results!

However iam also negative for this PR. Idea to add a GO layer for cmangos is not what i have in mind. Idea to not follow something more near to what KW stated as architecture too.

Your work may benefit to those who want it now or even to those who want to rewrite it in c++ using more what we have in mind as miro-services architecture.

So to be clear i do not expect this form of PR be merged any time soon. It will require a complete rethink that iam supposing you are not ready for it.

@cyberium Thanks for feedback.
I see some benefits from rewriting libsidecar from Go to C++, but having microservices in C++ wouldn't bring a lot of benefits to the table.

Well, currently the best case scenario for our infrastructure is a high cache medium core bare metal cpu, like lets say 5950x, which will make it run better than any clustering without rewrites or any cloud provider. (we are single thread bound, not multithreading bound)

@killerwife That’s true, the current setup works fine for most users. However, there are several things to consider:

  1. The current setup has limit. With the suggested architecture, you have the option for horizontal scalability that increases this limit.
  2. The suggested architecture enables high availability.
  3. Cross-realm support. For example, implementing cross-realm BG with the suggested architecture is not that big deal.

Depending on the person, it can bring some new knowledge and fun (maybe this is only relevant for me 😁).

But I understand your concerns regarding maintenance and complexity (maybe with a bit of an overreaction), and since there are no maintainers interested in this (which is totally fine), then I think it’s better to close this PR (but the discussion may continue).

Anyway, if someone is interested (or changes their mind) in this, I will be glad to discuss/cooperate.

@walkline walkline closed this Feb 20, 2024
@cyberium
Copy link
Member

cyberium commented Feb 20, 2024

Thank you for your understanding and for considering the feedback. While we may have different views on the architecture and direction of the project, I truly appreciate your contribution and the effort you've put into this PR.

Although this specific PR may not align with our current plans for the repository, your insights and suggestions are valuable for our future development efforts. We're always open to hearing your thoughts and ideas, and I hope we can continue to benefit from your expertise and contributions moving forward.

Thank you again for your work on this, and I look forward to collaborating with you on future endeavors.

Best regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants