-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API requests are slow at times (>1s) #85
Comments
Yup, I can confirm this. My immediate suspicion would be that it's related to GC pauses, but multi-second pauses for ~10 bots seems completely unreasonable. The memory usage also seems unreasonably stable. Next up I started suspecting that some periodic job was keeping some lock, or blocking too much of our thread pool. The periodic jobs that we run (AFAIK) are:
However, none of these should take any significant amount of time (and they don't, when I try calling them locally). It also seems unreasonable that the client-directed jobs would line up well enough to cause significant issues. Next up I suspected that a Slowloris-style attack was going on (even if that seems less likely with all the significant jitter going on). However, all connections go through Nginx which should (mostly) mitigate this. All in all, I'm sorry to say that I have no idea what's going on. @joakim-olesen Do you know how the threadpool setup looks? Are there any good tools for request-level profiling? I could probably make Nginx spit out those numbers, but it would be nice to not include the time spent waiting in ASP.NET's request queue. |
Okay.. weird. Suddenly the server became completely unable to connect to Mongo, eventually just timing out. It doesn't even appear in the Mongo logs anymore. I tried restarting the containers. Nope. I tried rebuilding the containers. No dice. I tried rebooting the server. Still nothing. I'll continue looking into it tomorrow. |
Aaand, we're back! (though the jitter persists...) For the record, the issue was https://jira.mongodb.org/browse/CSHARP-1895. It does manage to connect, but the task that is trying to select a server gets starved by the flood of incoming connections (which are all busy waiting for the connection to finish connecting). To get it working we need to start it while there is a light load, and then allow external clients to start hammering it (again). I wonder if the jitter issues are just a manifestation of the same bug... |
After fighting .NET for a while I managed to get the following list of expensive API calls: List of most expensive API calls
This makes List of most expensive API calls (after init is done)
This suggests that BoardsController::Post is probably a useful starting point for decoding the flame graph. Sadly, most of this overhead seems to be in the BSON codec, which we can't really affect much. There are long term plans to move most of the board state in-memory (which should cut away most of the DB communication), but we don't really want to make such major changes while the competition is still going on. |
We made all method calls to the mongodb (except from getting the high score list) asynchronous but it didn't fix the problem. Our changes are pushed. The issue even exists on one of our local computers with nvme ssd disk and only two bots playing. We can try storing the highscore list in-memory so we don't have to sort the list in each call to mongodb. So when the app starts, we fetch the highscore list from mongodb and sort it in memory. Every time we update the highscore list, we sort the list again (or use SortedList). That way the GET operation won't be so slow. We also noticed that the RAM usage increased while a bot was playing and when the bot stopped playing the RAM usage didn't decrease. We're unsure if this is an issue or not. |
That's interesting. I was unable to reproduce the issue locally even in the double digit number of bots. How did the (relative) RAM utilization look?
Probably not. Unless it continued to increase then I'd chalk this down to 1) lazy initialization tasks (connecting to the database, code caching, etc) that then stay resident, as well as 2) the runtime keeping an internal memory pool rather than releasing it back to the system. |
Not sure if this is related to Etimo/etimo-slack-scraper#1, but requests to the API are taking a lot of time quite often (it goes from ~45ms to 1s, 3s or 7s). I've noticed this the entire day, and it prevents anyone from reaching >190 score right now.
The behaviour is noticed not only when running the bot from the home network, but also when watching the board, where it occasionally freezes for a while, and also when
curl
ing the API from a completely different location.It seems to occur about every 10 seconds. It might be that it only happens when there are a lot of bots online, I don't know.
Currently, the board has 10 bots constantly moving around, maybe that plays a part. Everyone is running their bot during the night, so I don't see the situation getting any better during the nights either IF the bot load plays a part in the issue.
The text was updated successfully, but these errors were encountered: