Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate potential memory leak #522

Open
muety opened this issue Sep 19, 2023 · 4 comments
Open

Investigate potential memory leak #522

muety opened this issue Sep 19, 2023 · 4 comments
Assignees

Comments

@muety
Copy link
Owner

muety commented Sep 19, 2023

image

The plot shows memory usage (actual, not virtual) of the wakapi process on wakapi.dev over 7 days with Prometheus scrape interval of 5 minutes. We can see how memory accumulates during the day and all of a sudden drops precisely at 02:20 am (or anywhere in the past 5 minutes before that). The default for WAKAPI_AGGREGATION_TIME is 02:15 am. To check whether it's related, I changed the aggregation time to 01:45 am on Sep 19 and we can observe the drop happening around 01:50 am now. So apparently running the daily aggregation job causes memory to be freed. Let's investigate why that is and why memory accumulates during the day in the first place.

@muety muety self-assigned this Sep 19, 2023
@muety muety added the help wanted Extra attention is needed label Sep 19, 2023
@muety
Copy link
Owner Author

muety commented Sep 20, 2023

Found the cause.

Every time a user requests a summary for a relative time interval (such as today, past_6_months, etc.), the summary's to parameter (sometimes also from) will be time.Now(), which results in a new cache key and thus a new cache entry here. Once the nightly aggregation job runs, new summaries are inserted, as a side-effect of which the user's cache is being flushed - even before it's normal expiration after 24 hours.

That is, the more often you hit refresh on the dashboard page, the faster the memory will be filled up. This way, you could technically bring Wakapi down 😉.

Options to fix this include:

  • Don't cache summaries for relative time intervals. E.g. skip cache if to.Second() != 0, i.e. summary likely doesn't correspond to a clear-cut date (such as midnight or so) and thus will most likely never be retrieved again anyway.
  • Drop the cache in the first place, as it's anyway only useful for "absolute" time intervals (such as yesterday) and retrieving a summary isn't crazily expensive after we introduced the concept of durations
  • Just ignore it for now, as it doesn't really hurt

Cool side-effect of this investigation: I learned a lot about profiling in Go and added a profiler web endpoint to Wakapi.

@muety muety closed this as completed in 37ac895 Sep 20, 2023
@muety
Copy link
Owner Author

muety commented Sep 21, 2023

As expected, the pattern can still be observed after my latest commit. However, the delta is, as expected, a lot smaller now (~170 MB as opposed to ~450 MB before). We could get around this completely by disabling the cache (as discussed above), but I don't think it doesn't hurt much the way it is now.

@muety
Copy link
Owner Author

muety commented Jan 2, 2024

Found a new potential memory leak. Got aware of this because Wakapi is being OOM-killed sometimes lately.

image

@muety muety reopened this Jan 2, 2024
@muety muety closed this as completed Feb 22, 2024
@muety
Copy link
Owner Author

muety commented Oct 20, 2024

Reopening this, as Wakapi is still getting OOM-killed every now and then. It currently runs with GOMEMLIMIT=3000MiB on a machine with 8 GB of memory. From a first peak, I can't see any tremendous increases in used heap space shortly before getting killed, but will have to investigate further.

@muety muety reopened this Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant