-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sever is crashing due to memory swapping #1
Comments
In my PC, with the current interval (100 milliseconds):
With 1000 milliseconds:
With 2000 milliseconds:
With 3000 milliseconds:
With 4000 milliseconds:
With 5000 milliseconds:
|
We are having problems in the live demo server: torrust/torrust-demo#1 dut to a high CPU and memory usage.
We are having problems with the live demo server: torrust/torrust-demo#1 Due to a high CPU and memory usage.
We are having problems with the live demo server: torrust/torrust-demo#1 Due to a high CPU and memory usage.
b2c2ce7 feat: increase the tracker stast importer exec interval (Jose Celano) Pull request description: We are having problems with the live demo server. See: torrust/torrust-demo#1 Due to a high CPU and memory usage. This increases the tracker stats importer execution interval from 100 milliseconds to 2000 milliseconds. ACKs for top commit: josecelano: ACK b2c2ce7 Tree-SHA512: ea5a23e4250378c2cb5df6f3d5d81e989dfb1e8f3490e4e42acc70bb77b972a599bbdc0051738a45648aea37d38fe7703a6ca65177bb68b56f21d2b056cdfe19
It looks like the problem is memory consumption. The memory consumption increases until the server starts swapping. I think the problem could be:
The good thing is that the service restarts with the docker healthcheck after 2 hours. Looking at the docker containers, only the tracker was restarted, so I guess the problem is with the tracker (option 2 or 3):
I'm going to try first switching back to the previous repository implementation with the BTreeMap (there could be some deadlock with the SkiMap). If we still have the problem, we need to limit requests and/or memory consumption. On the other hand, I think we should improve the container healthchecks. We should not wait for 2 hours to restart the container. |
I've opened another issue to discover why there are so many zombie processes. I think there is an easy way to know if the healthcheks cause the crash on the server or not. I can run a private server with the same resources without requests. Another thing I could check is the number of requests the tracker is handling. Maybe the number of requests per second has increased a lot, and the problem is the server can't handle those requests. In that case, the only solution would be:
|
5 minutes after restarting the tracker container, I checked the stats:
The tracker is handling 234.96 req/sec. I think, in this case, the server could be simply busy, and we need to implement one of the solutions above (1 to 3). cc @da2ce7 UDPATE: Those are only the tracker client requests. We also have tracker API, Index, stats importer, ... |
I've been thinking again about why this is happening now. Regarding the Regarding the load level, It would be nice to have historical statistics. I would like to know if we have more requests than one or two weeks ago. That would help in understanding if the problem is we simply have more requests and we need to resize the server. I'm going to resize the droplet from memory 1GB ($6) to 2GB ($12) to see what happens: |
The server was restarted at 15:45 approx (now 17:43): After restarting the server, the tracker has handled:
It used to crash with 25req/sec with the previous instance size. Stats after two hours running the tracker: {
"torrents": 196455,
"seeders": 120032,
"completed": 1241,
"leechers": 200406,
"tcp4_connections_handled": 0,
"tcp4_announces_handled": 0,
"tcp4_scrapes_handled": 0,
"tcp6_connections_handled": 0,
"tcp6_announces_handled": 0,
"tcp6_scrapes_handled": 0,
"udp4_connections_handled": 1022134,
"udp4_announces_handled": 1864182,
"udp4_scrapes_handled": 63645,
"udp6_connections_handled": 0,
"udp6_announces_handled": 0,
"udp6_scrapes_handled": 0
} |
22:26 {
"torrents": 344237,
"seeders": 256095,
"completed": 4079,
"leechers": 420604,
"tcp4_connections_handled": 0,
"tcp4_announces_handled": 0,
"tcp4_scrapes_handled": 0,
"tcp6_connections_handled": 0,
"tcp6_announces_handled": 0,
"tcp6_scrapes_handled": 0,
"udp4_connections_handled": 3313661,
"udp4_announces_handled": 7467544,
"udp4_scrapes_handled": 236375,
"udp6_connections_handled": 0,
"udp6_announces_handled": 0,
"udp6_scrapes_handled": 0
}
|
08:08. The tracker has been restarted again. 24h graph: Last-hour graph: I could see how many requests it was handling before restarting. I want to know how many req/sec this instance can handle. It would be nice to collect statistics every 5 minutes. Maybe I can write a simple script to import them and update a csv file until we implement something like this. I can run it on the server. |
The tracker was restarted after running from 8:08 to 1:25. Stats at 22:30:
{
"torrents": 521222,
"seeders": 497614,
"completed": 8410,
"leechers": 892474,
"tcp4_connections_handled": 0,
"tcp4_announces_handled": 0,
"tcp4_scrapes_handled": 0,
"tcp6_connections_handled": 0,
"tcp6_announces_handled": 0,
"tcp6_scrapes_handled": 0,
"udp4_connections_handled": 7924576,
"udp4_announces_handled": 17939819,
"udp4_scrapes_handled": 497642,
"udp6_connections_handled": 0,
"udp6_announces_handled": 0,
"udp6_scrapes_handled": 0
}
I think I can confirm the problem is the number of peers and torrents increases, consuming more memory until the server start swapping to much and the container is restarted due to the healthcheck. I will open a discussion on the Tracker repo. We could limit the resources consumption by limiting the requests or memory but that would lead to worse responses. In a production environment I guess we should only monitor the resources and scale up. Maybe we could try a mixed solution:
In order to monitor this thing better it would be nice to have more statistics and show some graphs. |
On the 20th of April, the server was resized. {
"torrents": 1236779,
"seeders": 1460674,
"completed": 26980,
"leechers": 2691284,
"tcp4_connections_handled": 0,
"tcp4_announces_handled": 0,
"tcp4_scrapes_handled": 0,
"tcp6_connections_handled": 0,
"tcp6_announces_handled": 0,
"tcp6_scrapes_handled": 0,
"udp4_connections_handled": 24167291,
"udp4_announces_handled": 58160995,
"udp4_scrapes_handled": 1485013,
"udp6_connections_handled": 0,
"udp6_announces_handled": 0,
"udp6_scrapes_handled": 0
} 501.75 req/sec |
We are having problems with the live demo server: torrust/torrust-demo#1 Due to a high CPU and memory usage.
The droplet has started crashing, probably after the last update.
Just after rebooting:
After a while:
It looks like the Index is using a lot of CPU.
I don't know the reason yet. I think this recent PR could cause it:
torrust/torrust-index#530
Maybe the execution interval (100 milliseconds) is too short for this machine.
cc @da2ce7
The text was updated successfully, but these errors were encountered: