Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API requests are slow at times (>1s) #85

Open
alexanderhenne opened this issue Apr 8, 2019 · 6 comments
Open

API requests are slow at times (>1s) #85

alexanderhenne opened this issue Apr 8, 2019 · 6 comments

Comments

@alexanderhenne
Copy link
Contributor

alexanderhenne commented Apr 8, 2019

Not sure if this is related to Etimo/etimo-slack-scraper#1, but requests to the API are taking a lot of time quite often (it goes from ~45ms to 1s, 3s or 7s). I've noticed this the entire day, and it prevents anyone from reaching >190 score right now.

The behaviour is noticed not only when running the bot from the home network, but also when watching the board, where it occasionally freezes for a while, and also when curling the API from a completely different location.

It seems to occur about every 10 seconds. It might be that it only happens when there are a lot of bots online, I don't know.

Currently, the board has 10 bots constantly moving around, maybe that plays a part. Everyone is running their bot during the night, so I don't see the situation getting any better during the nights either IF the bot load plays a part in the issue.

@nightkr
Copy link
Contributor

nightkr commented Apr 8, 2019

Yup, I can confirm this. My immediate suspicion would be that it's related to GC pauses, but multi-second pauses for ~10 bots seems completely unreasonable. The memory usage also seems unreasonably stable.

Next up I started suspecting that some periodic job was keeping some lock, or blocking too much of our thread pool. The periodic jobs that we run (AFAIK) are:

  • Fetching the game board (each web client does this every 250ms)
  • Fetching the high score lists (each web client does this every 5s)
  • Clearing expired bots (done every minute by a cronjob)

However, none of these should take any significant amount of time (and they don't, when I try calling them locally). It also seems unreasonable that the client-directed jobs would line up well enough to cause significant issues.

Next up I suspected that a Slowloris-style attack was going on (even if that seems less likely with all the significant jitter going on). However, all connections go through Nginx which should (mostly) mitigate this.

All in all, I'm sorry to say that I have no idea what's going on.

@joakim-olesen Do you know how the threadpool setup looks? Are there any good tools for request-level profiling? I could probably make Nginx spit out those numbers, but it would be nice to not include the time spent waiting in ASP.NET's request queue.

@nightkr
Copy link
Contributor

nightkr commented Apr 8, 2019

Okay.. weird. Suddenly the server became completely unable to connect to Mongo, eventually just timing out. It doesn't even appear in the Mongo logs anymore. I tried restarting the containers. Nope. I tried rebuilding the containers. No dice. I tried rebooting the server. Still nothing. I'll continue looking into it tomorrow.

@nightkr
Copy link
Contributor

nightkr commented Apr 8, 2019

Aaand, we're back! (though the jitter persists...)

For the record, the issue was https://jira.mongodb.org/browse/CSHARP-1895. It does manage to connect, but the task that is trying to select a server gets starved by the flood of incoming connections (which are all busy waiting for the connection to finish connecting). To get it working we need to start it while there is a light load, and then allow external clients to start hammering it (again).

I wonder if the jitter issues are just a manifestation of the same bug...

@nightkr
Copy link
Contributor

nightkr commented Apr 11, 2019

After fighting .NET for a while I managed to get the following list of expensive API calls:

List of most expensive API calls
143.0156ms      Diamonds.Rest.Controllers.BoardsController.Post
143.1672ms      Diamonds.Rest.Controllers.BoardsController.Post
143.3778ms      Diamonds.Rest.Controllers.BoardsController.Post
143.4219ms      Diamonds.Rest.Controllers.BoardsController.Post
143.4345ms      Diamonds.Rest.Controllers.HighscoreController.Get
143.5793ms      Diamonds.Rest.Controllers.BoardsController.Post
143.5855ms      Diamonds.Rest.Controllers.BoardsController.Post
143.9136ms      Diamonds.Rest.Controllers.BoardsController.Post
144.4577ms      Diamonds.Rest.Controllers.BoardsController.Post
145.0713ms      Diamonds.Rest.Controllers.BoardsController.Post
146.2932ms      Diamonds.Rest.Controllers.BoardsController.Post
147.321ms       Diamonds.Rest.Controllers.BoardsController.Post
147.8077ms      Diamonds.Rest.Controllers.BoardsController.Post
147.8295ms      Diamonds.Rest.Controllers.BoardsController.Post
147.9595ms      Diamonds.Rest.Controllers.BoardsController.Post
148.0689ms      Diamonds.Rest.Controllers.BoardsController.Post
148.1413ms      Diamonds.Rest.Controllers.BoardsController.Post
148.159ms       Diamonds.Rest.Controllers.BoardsController.Post
148.4733ms      Diamonds.Rest.Controllers.BoardsController.Post
148.6854ms      Diamonds.Rest.Controllers.BoardsController.Post
148.9223ms      Diamonds.Rest.Controllers.BoardsController.Post
149.1391ms      Diamonds.Rest.Controllers.BoardsController.Post
149.16ms        Diamonds.Rest.Controllers.BoardsController.Post
149.2365ms      Diamonds.Rest.Controllers.BoardsController.Post
149.3884ms      Diamonds.Rest.Controllers.BoardsController.Post
149.3896ms      Diamonds.Rest.Controllers.BoardsController.Post
149.4757ms      Diamonds.Rest.Controllers.BoardsController.Post
149.7526ms      Diamonds.Rest.Controllers.BoardsController.Post
150.3971ms      Diamonds.Rest.Controllers.BoardsController.Post
151.2751ms      Diamonds.Rest.Controllers.BoardsController.Post
151.5122ms      Diamonds.Rest.Controllers.BoardsController.Post
151.9819ms      Diamonds.Rest.Controllers.BoardsController.Post
152.5169ms      Diamonds.Rest.Controllers.BoardsController.Post
153.1764ms      Diamonds.Rest.Controllers.BoardsController.Post
153.2053ms      Diamonds.Rest.Controllers.BoardsController.Post
153.5472ms      Diamonds.Rest.Controllers.BoardsController.Post
154.0087ms      Diamonds.Rest.Controllers.BoardsController.Post
154.6729ms      Diamonds.Rest.Controllers.BoardsController.Post
156.6395ms      Diamonds.Rest.Controllers.BoardsController.Post
157.4294ms      Diamonds.Rest.Controllers.BoardsController.Post
157.5268ms      Diamonds.Rest.Controllers.BoardsController.Post
157.7266ms      Diamonds.Rest.Controllers.BoardsController.Post
158.3464ms      Diamonds.Rest.Controllers.BoardsController.Post
159.4829ms      Diamonds.Rest.Controllers.BoardsController.Post
160.2222ms      Diamonds.Rest.Controllers.BoardsController.Post
160.2463ms      Diamonds.Rest.Controllers.BoardsController.Post
160.3425ms      Diamonds.Rest.Controllers.BoardsController.Post
160.5099ms      Diamonds.Rest.Controllers.BoardsController.Post
160.6267ms      Diamonds.Rest.Controllers.BoardsController.Post
161.8287ms      Diamonds.Rest.Controllers.BoardsController.Post
161.8733ms      Diamonds.Rest.Controllers.BoardsController.Post
162.0224ms      Diamonds.Rest.Controllers.BoardsController.Get
162.9573ms      Diamonds.Rest.Controllers.BoardsController.Post
162.9799ms      Diamonds.Rest.Controllers.BoardsController.Post
163.5156ms      Diamonds.Rest.Controllers.BoardsController.Post
164.2048ms      Diamonds.Rest.Controllers.HighscoreController.Get
164.333ms       Diamonds.Rest.Controllers.BoardsController.Get
164.5685ms      Diamonds.Rest.Controllers.BoardsController.Post
166.2879ms      Diamonds.Rest.Controllers.BoardsController.Get
167.2445ms      Diamonds.Rest.Controllers.BoardsController.Post
167.8147ms      Diamonds.Rest.Controllers.BoardsController.Post
167.9129ms      Diamonds.Rest.Controllers.BoardsController.Post
167.9707ms      Diamonds.Rest.Controllers.BoardsController.Get
168.1273ms      Diamonds.Rest.Controllers.BoardsController.Post
168.1686ms      Diamonds.Rest.Controllers.BoardsController.Post
171.8143ms      Diamonds.Rest.Controllers.BoardsController.Post
173.1774ms      Diamonds.Rest.Controllers.BoardsController.Post
174.0265ms      Diamonds.Rest.Controllers.BoardsController.Post
174.4525ms      Diamonds.Rest.Controllers.BoardsController.Post
176.1967ms      Diamonds.Rest.Controllers.BoardsController.Post
178.7707ms      Diamonds.Rest.Controllers.BoardsController.Post
179.9275ms      Diamonds.Rest.Controllers.BoardsController.Post
181.5058ms      Diamonds.Rest.Controllers.BoardsController.Post
182.1153ms      Diamonds.Rest.Controllers.BoardsController.Post
183.2223ms      Diamonds.Rest.Controllers.BoardsController.Post
183.3646ms      Diamonds.Rest.Controllers.BoardsController.Post
184.2486ms      Diamonds.Rest.Controllers.BoardsController.Post
188.8742ms      Diamonds.Rest.Controllers.BoardsController.Get
196.1252ms      Diamonds.Rest.Controllers.BoardsController.Get
200.6248ms      Diamonds.Rest.Controllers.BoardsController.Post
201.7787ms      Diamonds.Rest.Controllers.BoardsController.Post
204.8389ms      Diamonds.Rest.Controllers.BoardsController.Post
216.5302ms      Diamonds.Rest.Controllers.BoardsController.Post
218.6329ms      Diamonds.Rest.Controllers.BoardsController.Post
232.9728ms      Diamonds.Rest.Controllers.HighscoreController.Get
250.6434ms      Diamonds.Rest.Controllers.BoardsController.Post
262.8438ms      Diamonds.Rest.Controllers.BoardsController.Post
281.0738ms      Diamonds.Rest.Controllers.BoardsController.Post
760.8194ms      Diamonds.Rest.Controllers.BoardsController.Post
766.249ms       Diamonds.Rest.Controllers.BoardsController.Post
774.0006ms      Diamonds.Rest.Controllers.BoardsController.Post
926.962ms       Diamonds.Rest.Controllers.BoardsController.Post
2238.9778ms     Diamonds.Rest.Controllers.BotsController.Get
3240.5162ms     Diamonds.Rest.Controllers.BotsController.Get
3620.6332ms     Diamonds.Rest.Controllers.HighscoreController.Get
3738.0953ms     Diamonds.Rest.Controllers.BotsController.Get
4740.6418ms     Diamonds.Rest.Controllers.BotsController.Get
5736.6806ms     Diamonds.Rest.Controllers.BotsController.Get
6838.827ms      Diamonds.Rest.Controllers.BotsController.Get
6904.2665ms     Diamonds.Rest.Controllers.BoardsController.Post

This makes BotsController look really bad, but this seems to be due to performance issues connecting to the database. After excluding the first 10 calls the list looks like this instead:

List of most expensive API calls (after init is done)
141.3751ms      Diamonds.Rest.Controllers.BoardsController.Post
141.4284ms      Diamonds.Rest.Controllers.BoardsController.Post
141.6339ms      Diamonds.Rest.Controllers.BoardsController.Post
141.727ms       Diamonds.Rest.Controllers.BoardsController.Post
142.5243ms      Diamonds.Rest.Controllers.BoardsController.Post
142.582ms       Diamonds.Rest.Controllers.BoardsController.Post
142.6848ms      Diamonds.Rest.Controllers.BoardsController.Post
142.7893ms      Diamonds.Rest.Controllers.BoardsController.Post
142.8944ms      Diamonds.Rest.Controllers.BoardsController.Post
143.0156ms      Diamonds.Rest.Controllers.BoardsController.Post
143.1672ms      Diamonds.Rest.Controllers.BoardsController.Post
143.3778ms      Diamonds.Rest.Controllers.BoardsController.Post
143.4219ms      Diamonds.Rest.Controllers.BoardsController.Post
143.4345ms      Diamonds.Rest.Controllers.HighscoreController.Get
143.5793ms      Diamonds.Rest.Controllers.BoardsController.Post
143.5855ms      Diamonds.Rest.Controllers.BoardsController.Post
143.9136ms      Diamonds.Rest.Controllers.BoardsController.Post
144.4577ms      Diamonds.Rest.Controllers.BoardsController.Post
145.0713ms      Diamonds.Rest.Controllers.BoardsController.Post
146.2932ms      Diamonds.Rest.Controllers.BoardsController.Post
147.321ms       Diamonds.Rest.Controllers.BoardsController.Post
147.8077ms      Diamonds.Rest.Controllers.BoardsController.Post
147.8295ms      Diamonds.Rest.Controllers.BoardsController.Post
147.9595ms      Diamonds.Rest.Controllers.BoardsController.Post
148.0689ms      Diamonds.Rest.Controllers.BoardsController.Post
148.1413ms      Diamonds.Rest.Controllers.BoardsController.Post
148.159ms       Diamonds.Rest.Controllers.BoardsController.Post
148.4733ms      Diamonds.Rest.Controllers.BoardsController.Post
148.6854ms      Diamonds.Rest.Controllers.BoardsController.Post
148.9223ms      Diamonds.Rest.Controllers.BoardsController.Post
149.1391ms      Diamonds.Rest.Controllers.BoardsController.Post
149.16ms        Diamonds.Rest.Controllers.BoardsController.Post
149.2365ms      Diamonds.Rest.Controllers.BoardsController.Post
149.3884ms      Diamonds.Rest.Controllers.BoardsController.Post
149.3896ms      Diamonds.Rest.Controllers.BoardsController.Post
149.4757ms      Diamonds.Rest.Controllers.BoardsController.Post
149.7526ms      Diamonds.Rest.Controllers.BoardsController.Post
150.3971ms      Diamonds.Rest.Controllers.BoardsController.Post
151.2751ms      Diamonds.Rest.Controllers.BoardsController.Post
151.5122ms      Diamonds.Rest.Controllers.BoardsController.Post
151.9819ms      Diamonds.Rest.Controllers.BoardsController.Post
152.5169ms      Diamonds.Rest.Controllers.BoardsController.Post
153.1764ms      Diamonds.Rest.Controllers.BoardsController.Post
153.2053ms      Diamonds.Rest.Controllers.BoardsController.Post
153.5472ms      Diamonds.Rest.Controllers.BoardsController.Post
154.0087ms      Diamonds.Rest.Controllers.BoardsController.Post
154.6729ms      Diamonds.Rest.Controllers.BoardsController.Post
156.6395ms      Diamonds.Rest.Controllers.BoardsController.Post
157.4294ms      Diamonds.Rest.Controllers.BoardsController.Post
157.5268ms      Diamonds.Rest.Controllers.BoardsController.Post
157.7266ms      Diamonds.Rest.Controllers.BoardsController.Post
158.3464ms      Diamonds.Rest.Controllers.BoardsController.Post
159.4829ms      Diamonds.Rest.Controllers.BoardsController.Post
160.2222ms      Diamonds.Rest.Controllers.BoardsController.Post
160.2463ms      Diamonds.Rest.Controllers.BoardsController.Post
160.3425ms      Diamonds.Rest.Controllers.BoardsController.Post
160.5099ms      Diamonds.Rest.Controllers.BoardsController.Post
160.6267ms      Diamonds.Rest.Controllers.BoardsController.Post
161.8287ms      Diamonds.Rest.Controllers.BoardsController.Post
161.8733ms      Diamonds.Rest.Controllers.BoardsController.Post
162.9573ms      Diamonds.Rest.Controllers.BoardsController.Post
162.9799ms      Diamonds.Rest.Controllers.BoardsController.Post
163.5156ms      Diamonds.Rest.Controllers.BoardsController.Post
164.2048ms      Diamonds.Rest.Controllers.HighscoreController.Get
164.333ms       Diamonds.Rest.Controllers.BoardsController.Get
164.5685ms      Diamonds.Rest.Controllers.BoardsController.Post
167.2445ms      Diamonds.Rest.Controllers.BoardsController.Post
167.8147ms      Diamonds.Rest.Controllers.BoardsController.Post
167.9129ms      Diamonds.Rest.Controllers.BoardsController.Post
167.9707ms      Diamonds.Rest.Controllers.BoardsController.Get
168.1273ms      Diamonds.Rest.Controllers.BoardsController.Post
168.1686ms      Diamonds.Rest.Controllers.BoardsController.Post
171.8143ms      Diamonds.Rest.Controllers.BoardsController.Post
173.1774ms      Diamonds.Rest.Controllers.BoardsController.Post
174.0265ms      Diamonds.Rest.Controllers.BoardsController.Post
174.4525ms      Diamonds.Rest.Controllers.BoardsController.Post
176.1967ms      Diamonds.Rest.Controllers.BoardsController.Post
178.7707ms      Diamonds.Rest.Controllers.BoardsController.Post
179.9275ms      Diamonds.Rest.Controllers.BoardsController.Post
181.5058ms      Diamonds.Rest.Controllers.BoardsController.Post
182.1153ms      Diamonds.Rest.Controllers.BoardsController.Post
183.2223ms      Diamonds.Rest.Controllers.BoardsController.Post
183.3646ms      Diamonds.Rest.Controllers.BoardsController.Post
184.2486ms      Diamonds.Rest.Controllers.BoardsController.Post
188.8742ms      Diamonds.Rest.Controllers.BoardsController.Get
196.1252ms      Diamonds.Rest.Controllers.BoardsController.Get
200.6248ms      Diamonds.Rest.Controllers.BoardsController.Post
201.7787ms      Diamonds.Rest.Controllers.BoardsController.Post
204.8389ms      Diamonds.Rest.Controllers.BoardsController.Post
216.5302ms      Diamonds.Rest.Controllers.BoardsController.Post
218.6329ms      Diamonds.Rest.Controllers.BoardsController.Post
232.9728ms      Diamonds.Rest.Controllers.HighscoreController.Get
250.6434ms      Diamonds.Rest.Controllers.BoardsController.Post
262.8438ms      Diamonds.Rest.Controllers.BoardsController.Post
281.0738ms      Diamonds.Rest.Controllers.BoardsController.Post
760.8194ms      Diamonds.Rest.Controllers.BoardsController.Post
766.249ms       Diamonds.Rest.Controllers.BoardsController.Post
774.0006ms      Diamonds.Rest.Controllers.BoardsController.Post
926.962ms       Diamonds.Rest.Controllers.BoardsController.Post
3620.6332ms     Diamonds.Rest.Controllers.HighscoreController.Get

This suggests that BoardsController::Post is probably a useful starting point for decoding the flame graph. Sadly, most of this overhead seems to be in the BSON codec, which we can't really affect much. There are long term plans to move most of the board state in-memory (which should cut away most of the DB communication), but we don't really want to make such major changes while the competition is still going on.

nightkr added a commit that referenced this issue Apr 11, 2019
@nightkr nightkr added this to the Season 2 Release milestone Apr 26, 2019
@joakim-olesen
Copy link
Contributor

We made all method calls to the mongodb (except from getting the high score list) asynchronous but it didn't fix the problem. Our changes are pushed. The issue even exists on one of our local computers with nvme ssd disk and only two bots playing.

We can try storing the highscore list in-memory so we don't have to sort the list in each call to mongodb. So when the app starts, we fetch the highscore list from mongodb and sort it in memory. Every time we update the highscore list, we sort the list again (or use SortedList). That way the GET operation won't be so slow.

We also noticed that the RAM usage increased while a bot was playing and when the bot stopped playing the RAM usage didn't decrease. We're unsure if this is an issue or not.

@nightkr
Copy link
Contributor

nightkr commented Apr 26, 2019

The issue even exists on one of our local computers with nvme ssd disk and only two bots playing.

That's interesting. I was unable to reproduce the issue locally even in the double digit number of bots. How did the (relative) RAM utilization look?

We also noticed that the RAM usage increased while a bot was playing and when the bot stopped playing the RAM usage didn't decrease. We're unsure if this is an issue or not.

Probably not. Unless it continued to increase then I'd chalk this down to 1) lazy initialization tasks (connecting to the database, code caching, etc) that then stay resident, as well as 2) the runtime keeping an internal memory pool rather than releasing it back to the system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants