-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why dev.library.kiwix.org is regularly extremly slow? #194
Comments
it's not (entirely) an hardware issue. It's kiwix-serve crashing frequently. We chose to expose it directly to be aware of those things but it seems that the higher number of test zims increased the instability. I'd suggest we assess the need for each zim there and remove or move elsewhere those that don't need to be there (anymore). We already have a ticket on lib kiwix about those crashes. If we want to rely on dev library, kiwix serve should not be exposed. |
I don't think it is hardware limitation either, library.kiwix.org is running on the same machine and is not experiencing much slow down. Dev library is currently serving 897 ZIMs which should not be a concern (at least I would expect kiwix-serve to be able to handle this amount of ZIMs when run anywhere in the wild) So while we all agree we could probably prune most ZIMs present in this dev library, I consider this is not the right approach yet. Current situation is more a good opportunity to learn what is going wrong. This is the memory consumption of kiwix-serve for dev library (timezone is UTC) As you see, it restarts many times per day. Some of these restarts (e.g. at 4am UTC this morning) is linked to a rolling update due to a new image being available (we use the nightly build which is rebuild quite a lot obviously), hence the short double RAM usage (kubernetes starts a new updated container before stopping the old one). What is interesting to notice is that it seems to restart every time we get close to 1G RAM ... which is the limit of memory we've assigned for this container in Kubernetes. It does not looks like it is an OOM kill however, I do not find usual logs stating this event. This is nevertheless a very significant difference with prod which does not have any limit in terms of memory consumption. As an experiment, I've increased the memory limit to 1.5G, so that we can confirm if there is a correlation between the memory consumption / limit and the service restarts. Another aspects to keep in mind, as already stated in #147, is that we have a number of levers at our disposal to customize kiwix-serve behavior and control its memory consumption. None of them have been customized for dev.library. I continue to consider that doing small experiments on these values would greatly help understand and properly customize kiwix-serve behavior. |
I've pushed the memory graph to a dashboard dedicated to dev library. I hope we will update this dashboard with more metrics upon time. https://kiwixorg.grafana.net/d/fdlyk9cwqr8xsb/dev-library?orgId=1 |
My suggestion is linked to kiwix/libkiwix#760. We believe that some incorrect (how?) ZIM trigger crashes. Since we did not remove ZIMs from there, the culprits of that time are probably still present. We want to investigate those crashes but it's unrealistic because the library is huge, we have no idea which ones causes issues and the kiwix-serve logs are unusable: because of its formatting, because there's multi-users traffic at all times and because it doesn't log probably when this happens. The periodic restarts might be RAM related ; we'll see if the graph repeat but around 1.5GB 👍 |
Increasing memory available might then help with restarts but make the situation worse regarding crashes ^^ If we confirm we still have crashes, I would suggest to simply trash mostly the whole dev library in a one-shot manual action:
|
I really agree with this. What could be done as approach to better identify the reproduction steps for crash scenarios? To me: good chances we have a problem around Kiwix Server mgmt, see also kiwix/libkiwix#1025 |
Experiment conclusion seems quite clear: when we add more RAM, the DEV server restarts way less often. I increased the allocated RAM even further to 2.5GB, which seems to be sufficient for 24h activity (the DEV server always restarts at 4am UTC to apply the nightly build. I don't say this is the proper long term solution, but it might allow to confirm if we still suffer from crashes and when.
I don't know |
Fortunately, there's a lot of RAM to spare on storage server. |
Yep, and I'm quite sure I will soon start to experiment with kiwix-serve environment variables to reduce this RAM usage to a way more sustainable level 🤓 |
Yes, as discussed separately ; it's really important that those switches are properly documented so we can also leverage them on the hotspot. |
See also #170 |
As discussed yesterday, we all consider it is now time to move to a plan B, but this plan is still unclear. From my perspective, the experience with dev.library.kiwix.org is way better than before the RAM increase, but it is still not satisfactory, i.e. there are still some slowdowns. After some thought, I wonder if these slowdowns are not just linked to IO issues on the disk. On production, these issues could be hidden by the varnish cache which is expected to be especially efficient on the catalog and hence won't trigger problem on 8am UTC tests. How easy would it be to implement a Varnish cache in front of dev.library.kiwix.org as well? It looks pretty straightforward to me, and will clearly a flight forward, it will help to confirm the problem is most-probably IO related. |
Absolutely not. It will hide everything but the first request to a resource. It would be a good measure to improve the service to users but will not help (on the contrary) with finding the actual cause(s) behind this. I am still awaiting an update clarifying the role of dev.library. We had a lot of discussions about this when we started it but it seems to have shifted. Currently this is an internal testing tool:
I understand we are now sending links to users/clients on dev.library. That's the role of a staging library. Do we want prod/staging/dev ? Just prod/staging? |
Yes and we see that this might be quickly challenging to change the scope of it Therefore I have open a dedicated issue to think our requirement out of the box. See #199 |
Is there still a problem or shall we close this? |
Workflows havent failed since we changed HW |
I guess the hardware is on its limit but what are the services mostly responsible for that?
The text was updated successfully, but these errors were encountered: