-
Notifications
You must be signed in to change notification settings - Fork 857
Description
What happened:
Over time my https
server which hosts the FleetAutoscaler webhook goes OOM. This is caused by 1000s of never dying sockets on the server. This does NOT happen when I call it with cURL or a browser. It only happens with Agones calls the endpoint.
/app $ lsof -p $PID | grep socket
...
1 /app/zeus-rest socket:[289294771]
1 /app/zeus-rest socket:[289294783]
1 /app/zeus-rest socket:[289292336]
1 /app/zeus-rest socket:[289291653]
1 /app/zeus-rest socket:[289291654]
1 /app/zeus-rest socket:[289293769]
1 /app/zeus-rest socket:[289294780]
...
/app $ lsof -p $PID | grep socket | wc -l
6397
/app $ lsof -p $PID | grep socket | wc -l
6403
/app $ lsof -p $PID | grep socket | wc -l
6418
What you expected to happen:
I expect that when the FleetAutoscaler is called by Agones is either reuses the TLS client or it disconnects it. Keeping it alive and then making a new one seems naughty.
How to reproduce it (as minimally and precisely as possible):
Create a TLS FleetAutoscaler endpoint with keepalive turned on and no timeout specified and watch the sockets multiply.
Anything else we need to know?:
I suspect that this could be repaired by adding to
pkg/fleetautoscalers/fleetautoscalers.go
var client = http.Client{
Timeout: 15 * time.Second,
+++ Transport: &http.Transport{
+++ DisableKeepAlives: true,
+++ },
}
I fixed it by disabling KeepAlive on the server side. But it took me several hours to figure out the problem because I could not reproduce it with any clients of my own.
Environment:
- Agones version: 1.16
- Kubernetes version (use
kubectl version
): 1.21 - Cloud provider or hardware configuration: EKS and Minikube
- Install method (yaml/helm): helm
- Troubleshooting guide log(s):
- Others: