Poor packing with agones-allocator and Counter for high-capacity game servers #3992

dmorgan-blizz · 2024-09-20T22:19:54Z

What happened:

Allocated game servers are not efficiently packed, even though they have plenty of Counter space:

As you can see, when making allocation requests, Agones starts to fill up the first server, but then, before that server is anywhere close to its 300 Counter capacity, it starts sending players to the Ready buffer server. Now there are multiple game servers active, but even though those servers also have plenty of space, Agones will still occasionally allocate more Ready buffer servers.

This also occurred when scaling agones-allocator replicas down to 1 (from 3), and actually, the packing was worse in this scenario, even with an allocation rate as low as 5/s.

What you expected to happen:
Agones should always pick an Allocated game server with available Counter capacity before choosing a Ready buffer server.

How to reproduce it (as minimally and precisely as possible):
Game server settings like the following:

Kind: FleetAutoscaler
Spec:
  Policy:
    Type: Buffer
    Buffer:
      Buffer Size: 1

kind: Fleet
spec:
  scheduling: Packed
  template:
    spec:
      counters:
        players:
          capacity: 300

where a game server runs for a period of time and players continually cycle in and out (using SDK's DecrementCounterAsync(string key, long amount) upon game end);

and an allocation request like the following

GameServerSelectors = {
    new GameServerSelector {
        GameServerState = GameServerSelector.Types.GameServerState.Allocated,
        MatchLabels = {
            { "version", "1.2.3" },
        },
        Counters = {
            {
                "players", new CounterSelector {
                    MinAvailable = 1,
                }
            },
        },
    },
    new GameServerSelector {
        GameServerState = GameServerSelector.Types.GameServerState.Ready,
        MatchLabels = {
            { "version", "1.2.3" },
        },
        Counters = {
            {
                "players", new CounterSelector {
                    MinAvailable = 1,
                }
            },
        },
    },
},
Counters = {
    {
        "players", new CounterAction {
            Action = "Increment",
            Amount = 1,
        }
    },
},

using https://github.com/googleforgames/agones/blob/release-1.41.0/proto/allocation/allocation.proto and gRPC

agones:
  agones:
    allocator:
      service:
        http:
          enabled: false
        grpc:
          enabled: true
        serviceType: ClusterIP

Anything else we need to know?:
agones-allocator game server state does not seem to be shared between replicas (which seems like it should also be fixed, maybe Redis or something?), but as mentioned, even with only one replica, the packing was as bad or worse

Environment:

Agones version: 1.41.0
Kubernetes version (use kubectl version): 1.28.7-gke.1026001
Cloud provider or hardware configuration: GCP
Install method (yaml/helm): helm
Troubleshooting guide log(s):
Others:

The text was updated successfully, but these errors were encountered:

igooch · 2024-09-20T23:38:53Z

Could you try testing with only one node? To rule out whether or not it's hitting these lines and returning before it gets to the CountsAndLists logic

agones/pkg/gameserverallocations/allocation_cache.go

Lines 202 to 209 in 77face1

    
           // prefer nodes that have the most Ready gameservers on them - they are most likely to be 
        
           // completely filled and least likely target for scale down. 
        
           if c1.Ready < c2.Ready { 
        
           	return false 
        
           } 
        
           if c1.Ready > c2.Ready { 
        
           	return true 
        
           }

Jensaarai · 2024-09-23T17:08:55Z

Could you try testing with only one node? To rule out whether or not it's hitting these lines and returning before it gets to the CountsAndLists logic

We can try, that will take longer to set up though so it might be a bit before I can report results

markmandel · 2024-10-24T16:20:14Z

I am wondering if this has more to do with the way we cache during allocation.

i.e.

agones/pkg/gameserverallocations/allocator.go

Lines 563 to 567 in a27aeab

    
           if err := c.allocationCache.RemoveGameServer(gs); err != nil { 
        
           	// this seems unlikely, but lets handle it just in case 
        
           	req.response <- response{request: req, gs: nil, err: err} 
        
           	continue 
        
           }

Were we drop the allocated gameserver from the cache, and it may take a while to come back (but 5s seems way too long).

I wonder what would happen if we took an Allocated GameServer and put the allocated state back into the cache after we go it 🤔

Edit:

Something to try is here:

agones/pkg/gameserverallocations/allocator.go

Line 595 in a27aeab

    
           gs, err := c.applyAllocationToGameServer(ctx, res.request.gsa.Spec.MetaPatch, res.gs, res.request.gsa)

To put it back into allocationCache on successful allocation or re-allocation I think would make a lot of sense.

markmandel · 2024-10-24T16:28:52Z

Another thought on a workaround potentially (or to make things better), try with a shorter batch time on allocation, as that will refresh the list of potential allocated gameservers that the allocator looks at more often (but higher CPU usage).

…unter metric to test allocation for replicating issue googleforgames#3992

igooch · 2024-12-06T19:17:11Z

@dmorgan-blizz I was able to reproduce this issue when using more than one allocator client. When using only one allocator client, the game servers were packed as expected. Can you confirm that this is also what you're seeing as well?

(Details of test metric, fleet, allocator client request #4059)

…unter metric to test allocation for replicating issue googleforgames#3992

dmorgan-blizz · 2024-12-06T19:30:02Z

Interesting...and one allocation client means one execution of tmp/allocate.sh https://github.com/googleforgames/agones/pull/4059/files#diff-2863a956029e7b0d4dbada6deccdef2e148548b5c545df26950c05e93ee347f1 which is making serial requests?

And if you introduce another execution (thus, now concurrent requests), the issue reproduces?

Many concurrent requests is definitely our use case, but it will be easier to set up a test with a single server with a semaphore forcing serial requests so I'll try that. I will say that is not an actual solution for us, though, as we need the throughput of concurrent requests 😅 but I can at least help confirm if I see the same thing!

igooch · 2024-12-06T19:35:40Z

Interesting...and one allocation client means one execution of tmp/allocate.sh https://github.com/googleforgames/agones/pull/4059/files#diff-2863a956029e7b0d4dbada6deccdef2e148548b5c545df26950c05e93ee347f1 which is making serial requests?

And if you introduce another execution (thus, now concurrent requests), the issue reproduces?

Exactly

Many concurrent requests is definitely our use case, but it will be easier to set up a test with a single server with a semaphore forcing serial requests so I'll try that. I will say that is not an actual solution for us, though, as we need the throughput of concurrent requests 😅 but I can at least help confirm if I see the same thing!

Totally understand having one client is not a scalable model! At this point trying to pinpoint where the issue is happening, so we can come up with a practical fix.

dmorgan-blizz · 2024-12-07T02:39:13Z

My results mostly match up, though there were a few oddities.

Overall, it seems that yes, when only allowing our server to make one AllocationRequest at a time, the packing is better (note this is still with 3 agones-allocator backend services).

I had to work though some initial issues (pre-17:40) where I was requesting at a rate faster than Agones could handle serially (average allocation latency was ~500ms), so our server eventually got backed up with deadlined requests and I had to clear them out.

However, after that, in one of our game sites, allocation went as expected: when the rate was low enough that Agones could keep up (~1.5/s), the first game server filled up (green), and then a second one was allocated (yellow). Then later a 3rd (blue) as the first one shutdown, etc.

In another site, though, when I restarted my test, Agones allocated another server (green line), despite the original (blue line) still being available with capacity. You can see it preferred green until around 17:50 when it filled up and then blue starts getting players again (note: the dip around 18:00 is when I stopped my testing too early and restarted).

Another thing that was different was that around 18:10, I introduced a second server making serial allocation requests and roughly doubled the rate of allocation requests being made to the servers, and I didn't notice the bad behavior. I'm not sure if there were other factors at play here e.g. the still-low rate (~3/s or less), but that was surprising.

In case it helps, here are the allocation rates matching the times above.

igooch · 2024-12-19T23:10:04Z

To check if this is a Counter or an Allocator issue added a new metric in #4059 to track which game server was allocated, and removed the counter. The graphs below are for the same test run, just the second one is grouped by game server name so it's a bit easier to see. The test run used 5 allocator clients with each client making 100s of serial requests. Each line on the first graph represents a grpc client connection.

five_allocator_clients_with_many_allocations_each

Even without the Counter the first allocated game server is not always returned. So, it looks like the issue is somewhere in the allocator logic (mostly likely the caching of game servers) rather than explicitly a Counters issue.

We might be able to create some custom logic for Counters / Lists to change the way that our allocation cache works to only remove at-capacity game servers (perhaps guarding with some kind of flag as well). We'll need to be careful that all the allocator pods have a similar view of the world, so that we don't end up with 2+ allocator pods trying and failing to allocate to a game server after its already been filled to capacity by a different allocator pod.

dmorgan-blizz added the kind/bug These are bugs. label Sep 20, 2024

igooch self-assigned this Nov 19, 2024

igooch added a commit to igooch/agones that referenced this issue Dec 5, 2024

Test fleet, allocator-client, and custom 'players' per game server co…

b5ff81b

…unter metric to test allocation for replicating issue googleforgames#3992

igooch mentioned this issue Dec 6, 2024

Test Counter Metric [DO NOT MERGE] #4059

Draft

igooch added a commit to igooch/agones that referenced this issue Dec 6, 2024

Test fleet, allocator-client, and custom 'players' per game server co…

c917986

…unter metric to test allocation for replicating issue googleforgames#3992

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor packing with agones-allocator and Counter for high-capacity game servers #3992

Poor packing with agones-allocator and Counter for high-capacity game servers #3992

dmorgan-blizz commented Sep 20, 2024 •

edited

Loading

igooch commented Sep 20, 2024

Jensaarai commented Sep 23, 2024

markmandel commented Oct 24, 2024 •

edited

Loading

markmandel commented Oct 24, 2024

igooch commented Dec 6, 2024

dmorgan-blizz commented Dec 6, 2024

igooch commented Dec 6, 2024

dmorgan-blizz commented Dec 7, 2024 •

edited

Loading

igooch commented Dec 19, 2024

Poor packing with agones-allocator and Counter for high-capacity game servers #3992

Poor packing with agones-allocator and Counter for high-capacity game servers #3992

Comments

dmorgan-blizz commented Sep 20, 2024 • edited Loading

igooch commented Sep 20, 2024

Jensaarai commented Sep 23, 2024

markmandel commented Oct 24, 2024 • edited Loading

markmandel commented Oct 24, 2024

igooch commented Dec 6, 2024

dmorgan-blizz commented Dec 6, 2024

igooch commented Dec 6, 2024

dmorgan-blizz commented Dec 7, 2024 • edited Loading

igooch commented Dec 19, 2024

dmorgan-blizz commented Sep 20, 2024 •

edited

Loading

markmandel commented Oct 24, 2024 •

edited

Loading

dmorgan-blizz commented Dec 7, 2024 •

edited

Loading