Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Adding proposal for exposing metrics per GameServer #1845

Closed
domgreen opened this issue Oct 14, 2020 · 14 comments
Closed

[Docs] Adding proposal for exposing metrics per GameServer #1845

domgreen opened this issue Oct 14, 2020 · 14 comments
Labels
area/operations Installation, updating, metrics etc kind/design Proposal discussing new features / fixes and how they should be implemented kind/feature New features for Agones wontfix Sorry, but we're not going to do that.

Comments

@domgreen
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I would like to be able to get metrics such as connected users (currently via player tracking) as well as wider arbitary metrics from game servers such as frame rate.

Describe the solution you'd like
Outlined proposals below

Describe alternatives you've considered
Outlined in proposal below

Additional context
Relates to:
#1035
#1036
#1037

Game Metrics from Dedicated Game Servers

Summary

Engineers often want to observe "metrics" which can be defined as "raw measurements, generally with the intent to produce continuous summaries of those measurements" from within their Dedicated Game Servers.

A proposal to do this would be to allow the DGS to send a small number (100ish) metrics via the SDK to the running agones sidecar. This requires a simple addition to the sidecar to allow it to expose these metrics via the already used OpenCensus integrations. We can update the SDK to have a metrics sub API (much like alpha and beta stages) to support metric work.

Metrics from dedicated servers often break down into two categories, firtly metrics that Agones already knows about count of players, free capactity etc. and arbitary metrics from within the DGS itself such as frame rate, number of sessions, total rings collected etc.

The main concerns with exposing arbitary metrics is to not expose too many, not impact the running DGS and not to reinvent the wheel but choose the correct technologies to work together.

We could theoretically break this into two proposals:

  • expose Agones related metrics (things the sidecar knows about)
  • expose DGS metrics (things Agones doesnt need to know about)

Related Links

Goals

  • Expose metrics around the current connected players
  • Expose arbitary metrics from within the dedicated game server
    • eg. frame rate, number of concurrent sessions, number of failed connections
    • to confirm if this is a desire of the Agones project
  • Agnostic to the choice of running engine
    • eg. supports by all engines from C#, Rust through to Unreal and Unity
  • Support the basics of Counters, Gauge (UpDownCounter) and labels on these time series
  • Minimal impact on the Mem/CPU of the running DGS

Non-Goals

  • Expose metrics around the infrastrucutre usage of the DGS
    • this can be acheived with existing projects
  • Reinvent an existing solution
  • Support an infinate number of arbitary metrics
  • Additional aggregation of metrics
  • Storing metrics

Proposal: Limited number of metrics exposed via sidecar

Initial proposal would be to allow the sidecar to expose metrics itself, initally this would be the specific to the PlayerTracking data such as current number of connected users, free space that can be allocated per DGS and other such metrics that are already known to Agones. This would use the existing OpenCensus (and later OTel) implementation that is being used within Agones to allow these metrics to be scraped by other projects such as Prometheus. Once these metrics are exposed we can simply add another ServiceMonitor as is currently done with the controller to communicate with the Prometheus Operator metrics are avaliable to scrape. This would probably require the addition of a agones-sidecar service to route the traffic or defer back to a PodMonitor.

With the metrics known to Agones this is simple as state above however, metrics that are not nativley known by Agones (frame rate/sessions etc.) we would need a way to expose these metrics via the SDK to the sidecar. Here, we could simply add a small API that supports the basics of metrics (Counters, Guages, Labels) that sends the data to the sidecar that is exposed in the same way as the Agones native metrics are exposed.

It would be up to the Game Engineers to call this API when data they wish to record changes and would therefore be this would allow them to record the metrics that they are interested in from within the context of their own games.

I would envisage this API being a subset of the Agones SDK such as Alpha and Beta sections currently are. I would also look to limit (could be configurable) the number of metrics a game could send to the sidecar this would reduce the burden on the sidecar and reduce explosion of metrics that are reported.

One advantage of this approach is that we end up storing metrics outside of the DGS and in the sidecar causing much less of a memory impact on the running game. We also create a basic API so that all game running in the Agones project can report metrics in the same way.

The main problem however, is wether this is actually the responsibility of Agones to report on metrics from within a running game server and if so do we then also expose future APIs around events and other aspects of obersvability? I would argue that a mature platform would allow the engineers running their game server to pick up key observability points that they are interested in but could be swayed that this is the responsibility of another project much like CPU/Mem can be obtained directly from kubernetes APIs without the need to intefer with the DGS iteself.

Pro:

  • Small extension the Agones SDK
  • SDK means all game engines can report in a uniform way
  • Native Agones and arbitary metrics can be collected in the same way
  • low impact on the memory of the DGS
  • Prometheus scraping is already documented and advised as part of the Agones project
    • no need for extra tech as with the alternatives

Con:

  • Adding the the SDK surface area
  • Push based from the DGS to sidecar (not sure this is a con tbh)

Alternatives considered

Expose metrics via logs

This would be my proposal for getting arbitary metrics from a DGS if we were to seperate the two concerns of Agones vs none Agones metrics. In that case I would expose /metrics in the sidecar as standard for the data known to Agones (connected users, spaces free etc.).

In this alternative we would use the inbuilt loggers from the engines to log in a given format that then log shippers would be able to turn into metrics. An example of this is the Prometheus sink for <vector.dev> which would allow you to transform logs and expose them to Prometheus.

The main concern around this is that we probably would not be able to form a decent standard of how to log metrics and would be up to the engineers running the servers and their game teams to discuss the best approach per individual case.

Pros:

  • small amount of work in sidecar to expose Agones metrics
  • Agones deals with only Agones related metrics

Cons:

  • more work for engineers wanting to get metrics from within DGS
  • no nice API to program against
  • would need to form a standard within each company to send metrics via logs
  • some games engines (Unreal) have logs that are designed to be human readable therefore not JSON compatible

Expose a OpenTelemitry/Prometheus endpoint from within the DGS

This appoach would be a more standardised way of exposing metrics to via a running service and would most probably be supported in engines such as vanilla Rust, Go, Javascript etc. but other games servers Unreal for example does not support the idea of running a web server within the game to expose these metrics.

This also doesnt take into consideration that sidecar has metrics such as the number of connected players, free space on the DGS etc. this means we would probably end up implementing a /metrics endpoint within the DGS and its sidecar.

Pros:

  • pull based
  • supported out of the box in some "engines"
  • would leave the engineers to choose technologies (Prom, OTel etc.)
  • no work on SDK needed

Cons:

  • unsupported in game engines such as Unity and Unreal
  • memory impact on the running DGS

Expose metrics via an agreed file format

This approach would instead of sending metrics over the wire could instead write metrics to a specific file mounted on the pod in a well known format see the link above. This could then be picked up by something to expose or ship the metrics to a needed place for aggregation.

Pros:

  • less memory impact (assumed)

Cons:

  • would be down to the engineer to expose it in the needed format
  • above link only Prometheus format, may not suite all needs
  • sidecar would be exposing a /metrics endpoint anyway
@domgreen domgreen added the kind/feature New features for Agones label Oct 14, 2020
@markmandel markmandel added the kind/design Proposal discussing new features / fixes and how they should be implemented label Oct 14, 2020
@markmandel
Copy link
Collaborator

markmandel commented Oct 14, 2020

I am bad at metrics, so take my comment with a large grain of salt 😄 🧂

We currently expose some gameserver metrics (maybe just in aggregate) through the Agones controller looking at GameServer events.

Is there a good reason we shouldn't expose these metrics through the controller, rather than through the sidecar? (I think I know the answer, but figured I would ask the question)

@markmandel markmandel added the area/operations Installation, updating, metrics etc label Oct 14, 2020
@markmandel
Copy link
Collaborator

I had another potentially fun question.

I feel like we've conflated two things here:

  1. Exposing Player Tracking data via metrics
  2. Exposing arbitrary game server metrics

Since Player Tracking data is stored on the CRD, we could use a similar pattern to what we have now for metrics -- wherein it comes from the controller for Player Tracking, but the Metrics being exposed directly on a GameServer for arbitrary metrics.

Just wondering if these two things should be designed separately? Not sure - just asking the question.

@domgreen
Copy link
Contributor Author

TLDR; I think we both are coming to the same conclusion... 2 separate issues.

@markmandel this is something that I was hoping would come up ... my design calls this out that it probably is two separate problems. Was going to bring this up at the next community meeting.

PlayerTracking / Agones Metrics

I feel that there is Agones metrics (Player Tracking etc.) and arbitrary metrics (per DGS) as you mention we could go with the approach of using the existing event based metrics for things that are in the CRD ... not sure why we dont just expose it via a /metrics endpoint on the sidecar but if its not broke 🤷

Which brings us onto the previous question why expose on the sidecar?

  • standard prometheus approach
  • allows much finer grained querying
  • reduces memory needed for the controller to serve metrics
  • allows extra metrics to be added with ease from the sidecar
  • IMO reduces complexity

Arbitrary Metrics

With arbitrary metris this really comes down to the game server and why I have been struggling with the question "Is this even an Agones issue" ... Unreal especially is bad at giving an engineer metrics about what is inside the container it just wasnt really desigend in this way... game servers written in Rust/Go/node etc. would be able to expose metrics in a standard way using OTel/Prom etc. so why add complexity just for Unreal/Unity?

TBH i suppose most AAAs will be using either Unreal/Unity or a self made engine, in this case if it is down to the engine I would go with my first alternative and aggregate metrics from logs using a log shipper like vector.

It might be down to a number complimentary projects to help here (that could be linked in Engine readmes) however everyone running Agones will eventually bump into this problem ... how to observe running DGSs.

I am in favour of splitting this issue ...

  • Agones metrics ... specifically PlayerTracking
    • with the events approach I think this is okay, not my personal preference but it already works 👍
    • can we enrich these events ... would be nice to have the ID of the player that joined (need to double check this) rather than just the current CCU is 2
  • Arbitrary Observability ... looking inside the black box that is a "modern" DGS
    • personally will look into events from logs that get shipped to OTEL tools and Prom via a shipper

@markmandel
Copy link
Collaborator

That makes sense. I think for Arbitrary metrics, it 100% makes sense to expose it via the Sidecar, and have a limited set of metrics capabilities exposed. If not to limit our SDK size, but also to gather feedback. It's also just a nice experience for users to have a metrics SDK pre-baked.

The line I would draw in the sand would be: If we're pushing out metrics based on CRD data, make it come from the controller, since it's tracking all that info anyway, and we already have the infra to manage that.

If we're pushing out metrics that are not based on CRD data (which arbitrary ones certainly don't), then it should come through the sidecar (from above, and from our usual patterns, it sounds like /metrics is the way to go) -- this will also allow us to provide metrics on the sidecar itself, which could also be useful as well, and operators only need to configure the one capture endpoint). I'm also thinking if we have predefined metrics that people can populate (frame rate seems like an obvious one), then this also falls into this bucket.

How does that sound?

As per why it's that way in the first place, I wonder if @cyriltovena will come back around and see us 😄 (also be interested in your take on the above).

I think it has a lot to do with the fact that the controller knows about all the gameservers - where the sidecar only knows about itself -- so it's easier to calculate aggregate data -- but like I said previously, I'm not very good at metrics, so I'm the wrong person to ask, and definitely take what I'm saying with a grain of salt.

@domgreen
Copy link
Contributor Author

Main thing i don't fully grok yet is arbitrary metrics is it something that Agones would want to take on or is it a separate tool / project. Where does the scope of Agones end?

@markmandel
Copy link
Collaborator

Oh a note I should make - we should stick to OpenCensus, since that's our currently library of choice for metrics (so we can support multiple backends) -- we may want to upgrade it and/or It may also be time to look to move to open telemetry if it's ready (research required).

Main thing i don't fully grok yet is arbitrary metrics is it something that Agones would want to take on or is it a separate tool / project. Where does the scope of Agones end?

That's a good question. I feel like there are some out of the box metrics we can define that most people will want/need (frame rate seems like an obvious choice).

Say we do all that work -- we then have the ability to share (some?) of that interface with the user, which is pretty handy.

But I get your hesitation. What I would probably suggest - let's focus on player counts, and frame rate (or other specific ones if you have strong opioniosn), and see where we end up after that. That will give us some exploration of the area, and may provide more input into make the decision on arbitrary metrics.

It's possible you are right -- maybe we decide complex custom metrics are out of scope, and down to the user? Or maybe we look at it and realise it's a small amount of work to add it in and therefore "why not". Won't know until we try?

How does that sound?

@domgreen
Copy link
Contributor Author

100% sticking with OpenCensus no need change till OTel is stable.

Okay, let's get a list of arbitrary metrics together (would be good to get input here as I'm struggling to think of too many that would be across the board) and mock out the proto to see what the API surface area of what might look like.

Will look at getting some kind of proto up next week, will be busy prepping for KubeCon so will be latter part of the week.

@markmandel
Copy link
Collaborator

Sounds like a plan! 👍

@markmandel
Copy link
Collaborator

This came up in Slack, but I wanted to capture my thoughts here:

The more we talk about this, the more I'm convinced this should be a standalone OSS project that has a sidecar (probably open telemetry) and grpc/rest interface (maybe rest+proto?) And proper game engine sdks that do appropriate Async/threaded operations to send data to keep it off the main loop 👍

I don't think it has to be integrated with Agones at all. In fact it's probably more applicable if it isn't.

I think if the project existed, it would be an awesome addition to Third Party Libraries and Tools though.

@github-actions
Copy link

github-actions bot commented Aug 1, 2023

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

@github-actions github-actions bot added the stale Pending closure unless there is a strong objection. label Aug 1, 2023
@roberthbailey
Copy link
Member

@domgreen - Is this something that you are still interested in helping drive forward?

@github-actions github-actions bot removed the stale Pending closure unless there is a strong objection. label Sep 1, 2023
@markmandel
Copy link
Collaborator

I'm still of mind that this project shouldn't be part of Agones, so we should really close this issue.

@domgreen
Copy link
Contributor Author

domgreen commented Sep 6, 2023

I'm still of mind that this project shouldn't be part of Agones, so we should really close this issue.

Fully agree should have closed it earlier, my bad.

@markmandel
Copy link
Collaborator

Fully agree should have closed it earlier, my bad.

No worries at all! I'll close the issue now then.

Still an interesting project idea -- although I feel like everyone I know does log based metrics and calls it a day 😄

@markmandel markmandel added the wontfix Sorry, but we're not going to do that. label Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operations Installation, updating, metrics etc kind/design Proposal discussing new features / fixes and how they should be implemented kind/feature New features for Agones wontfix Sorry, but we're not going to do that.
Projects
None yet
Development

No branches or pull requests

3 participants