Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why dev.library.kiwix.org is regularly extremly slow? #194

Closed
kelson42 opened this issue May 16, 2024 · 19 comments
Closed

Why dev.library.kiwix.org is regularly extremly slow? #194

kelson42 opened this issue May 16, 2024 · 19 comments
Assignees
Labels
question Further information is requested

Comments

@kelson42
Copy link
Contributor

I guess the hardware is on its limit but what are the services mostly responsible for that?

@kelson42 kelson42 added the question Further information is requested label May 16, 2024
@rgaudin
Copy link
Member

rgaudin commented May 16, 2024

it's not (entirely) an hardware issue. It's kiwix-serve crashing frequently. We chose to expose it directly to be aware of those things but it seems that the higher number of test zims increased the instability.

I'd suggest we assess the need for each zim there and remove or move elsewhere those that don't need to be there (anymore).
This would greatly simplify any investigation.

We already have a ticket on lib kiwix about those crashes.

If we want to rely on dev library, kiwix serve should not be exposed.

@benoit74
Copy link
Collaborator

I don't think it is hardware limitation either, library.kiwix.org is running on the same machine and is not experiencing much slow down.

Dev library is currently serving 897 ZIMs which should not be a concern (at least I would expect kiwix-serve to be able to handle this amount of ZIMs when run anywhere in the wild)

So while we all agree we could probably prune most ZIMs present in this dev library, I consider this is not the right approach yet.

Current situation is more a good opportunity to learn what is going wrong.

This is the memory consumption of kiwix-serve for dev library (timezone is UTC)

image

As you see, it restarts many times per day. Some of these restarts (e.g. at 4am UTC this morning) is linked to a rolling update due to a new image being available (we use the nightly build which is rebuild quite a lot obviously), hence the short double RAM usage (kubernetes starts a new updated container before stopping the old one).

What is interesting to notice is that it seems to restart every time we get close to 1G RAM ... which is the limit of memory we've assigned for this container in Kubernetes. It does not looks like it is an OOM kill however, I do not find usual logs stating this event. This is nevertheless a very significant difference with prod which does not have any limit in terms of memory consumption.

As an experiment, I've increased the memory limit to 1.5G, so that we can confirm if there is a correlation between the memory consumption / limit and the service restarts.

Another aspects to keep in mind, as already stated in #147, is that we have a number of levers at our disposal to customize kiwix-serve behavior and control its memory consumption. None of them have been customized for dev.library. I continue to consider that doing small experiments on these values would greatly help understand and properly customize kiwix-serve behavior.

@benoit74
Copy link
Collaborator

I've pushed the memory graph to a dashboard dedicated to dev library. I hope we will update this dashboard with more metrics upon time.

https://kiwixorg.grafana.net/d/fdlyk9cwqr8xsb/dev-library?orgId=1

@rgaudin
Copy link
Member

rgaudin commented May 17, 2024

So while we all agree we could probably prune most ZIMs present in this dev library, I consider this is not the right approach yet.

My suggestion is linked to kiwix/libkiwix#760. We believe that some incorrect (how?) ZIM trigger crashes. Since we did not remove ZIMs from there, the culprits of that time are probably still present.
I am curious to know if removing them would reduce the number of crashes/restart.

We want to investigate those crashes but it's unrealistic because the library is huge, we have no idea which ones causes issues and the kiwix-serve logs are unusable: because of its formatting, because there's multi-users traffic at all times and because it doesn't log probably when this happens.

The periodic restarts might be RAM related ; we'll see if the graph repeat but around 1.5GB 👍

@benoit74
Copy link
Collaborator

Increasing memory available might then help with restarts but make the situation worse regarding crashes ^^

If we confirm we still have crashes, I would suggest to simply trash mostly the whole dev library in a one-shot manual action:

  • I list the ZIMs present today
  • we pin the few ZIMs which are necessary to keep
  • we move everything else to a quarantine zone for 3 months, should we realize we forgot to pin some valuable ZIMs
  • 3 months later, we delete the quarantine zone

@kelson42
Copy link
Contributor Author

kelson42 commented May 17, 2024

Current situation is more a good opportunity to learn what is going wrong.

I really agree with this. What could be done as approach to better identify the reproduction steps for crash scenarios?

To me: good chances we have a problem around Kiwix Server mgmt, see also kiwix/libkiwix#1025

@benoit74
Copy link
Collaborator

Experiment conclusion seems quite clear: when we add more RAM, the DEV server restarts way less often.

image

I increased the allocated RAM even further to 2.5GB, which seems to be sufficient for 24h activity (the DEV server always restarts at 4am UTC to apply the nightly build. I don't say this is the proper long term solution, but it might allow to confirm if we still suffer from crashes and when.

What could be done as approach to better identify the reproduction steps for crash scenarios?

I don't know

@rgaudin
Copy link
Member

rgaudin commented May 21, 2024

Fortunately, there's a lot of RAM to spare on storage server.

@benoit74
Copy link
Collaborator

Fortunately, there's a lot of RAM to spare on storage server.

Yep, and I'm quite sure I will soon start to experiment with kiwix-serve environment variables to reduce this RAM usage to a way more sustainable level 🤓

@rgaudin
Copy link
Member

rgaudin commented May 21, 2024

Yes, as discussed separately ; it's really important that those switches are properly documented so we can also leverage them on the hotspot.

@kelson42
Copy link
Contributor Author

See also #170

@kelson42
Copy link
Contributor Author

@rgaudin @benoit74 i believe we might run a performance push taskforce around kiwix-serve to tackle these kind of problems. Might be actually a hackhaton topic.

@rgaudin
Copy link
Member

rgaudin commented May 27, 2024

Screenshot 2024-05-27 at 08 08 13

Twice this week the GH actions that runs at 8am UTC failed: On May 25th and on May 27th. In both cases I get Read timed out (5s) on the test but the service is running, has not restarted and is not close to the RAM crash in the graph.
Testing on some random ZIM/content soon after one failure was working OK. Maybe some requests from the tests (all are catalog related) are difficult to answer within 5s under certain circumstances…

@benoit74
Copy link
Collaborator

As discussed yesterday, we all consider it is now time to move to a plan B, but this plan is still unclear.

From my perspective, the experience with dev.library.kiwix.org is way better than before the RAM increase, but it is still not satisfactory, i.e. there are still some slowdowns.

After some thought, I wonder if these slowdowns are not just linked to IO issues on the disk. On production, these issues could be hidden by the varnish cache which is expected to be especially efficient on the catalog and hence won't trigger problem on 8am UTC tests.

How easy would it be to implement a Varnish cache in front of dev.library.kiwix.org as well? It looks pretty straightforward to me, and will clearly a flight forward, it will help to confirm the problem is most-probably IO related.

@rgaudin
Copy link
Member

rgaudin commented May 28, 2024

it will help to confirm the problem is most-probably IO related.

Absolutely not. It will hide everything but the first request to a resource. It would be a good measure to improve the service to users but will not help (on the contrary) with finding the actual cause(s) behind this.

I am still awaiting an update clarifying the role of dev.library. We had a lot of discussions about this when we started it but it seems to have shifted.

Currently this is an internal testing tool:

  • for ZIM content team to validate their WIP recipes
  • to test nightly kiwix-serve via many eyes on a live scenario.

I understand we are now sending links to users/clients on dev.library. That's the role of a staging library.

Do we want prod/staging/dev ? Just prod/staging?

@kelson42
Copy link
Contributor Author

Currently this is an internal testing tool:

Yes and we see that this might be quickly challenging to change the scope of it Therefore I have open a dedicated issue to think our requirement out of the box. See #199

@rgaudin
Copy link
Member

rgaudin commented Jun 11, 2024

Still fails everyday (timeout)

Screenshot 2024-06-11 at 08 06 27

Not related to resources apparently (restarts at 04:00)

Screenshot 2024-06-11 at 08 08 32

@benoit74
Copy link
Collaborator

benoit74 commented Oct 4, 2024

Is there still a problem or shall we close this?

@rgaudin
Copy link
Member

rgaudin commented Oct 4, 2024

Workflows havent failed since we changed HW

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants