Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leaks when the circular buffer is full #1521

Closed
jrcheli opened this issue Jun 12, 2023 · 3 comments · Fixed by #1551
Closed

Fix memory leaks when the circular buffer is full #1521

jrcheli opened this issue Jun 12, 2023 · 3 comments · Fixed by #1551
Milestone

Comments

@jrcheli
Copy link
Contributor

jrcheli commented Jun 12, 2023

When scoping nginx, and we don't have a connection to send events out, the libraries' circular buffer will eventually fill up.
When this happens, we free events but not the event data they contain. Hitting nginx with "ab" is an easy way to put the library in this stressed position where we leak memory.

@jrcheli
Copy link
Contributor Author

jrcheli commented Jun 14, 2023

1)

If the event circular buffer fills up, then we're not fully freeing the events that are getting dropped on the datapath side. I have a fix for this on the bug/1521-mem-leaks branch.

2)

On the reporting side, we A) read from the event circular buffer, and B) generate both events and metrics, and C) send both events and metrics. Right or wrong, I expected that things would stay on the event circular buffer until an event destination is connected, but this isn't true. If events and metrics are routed to different locations and the metric destination is configured to be udp, we'll read from the event circular buffer for the sake of metrics. From my perspective our behavior implies that we've decided that metric data is more important than event data. Right or wrong, I feel like the opposite is true. I don't have a suggestion here, except to double check that we want the behavior the way we have it now.

3)

We have one thread reading off of the event circular buffer on the reporting side. If A) the datapath tries to put things on this queue faster than the reporting side can consume them, and B) the event circular buffer fills up then we find ourselves in an ugly state, particularly w.r.t. http. Why http? It's unique in that we try to keep some state from an http request that is only cleaned up when we marry it to it's paired http response. (g_maplist)

Say we have a sequence events: http request A, http response A, http request B, http response B. If the event circular buffer was full and the reporting thread wasn't actively trying to read, all four of these would end up on the floor. Ideally we wouldn't lose data, but this scenario isn't the ugly one. Since we lost both each request and it's paired response, we won't accumulate cruft on the reporting side.

Where it gets ugly is when the circular buffer is full, but the reporting thread consumes events so that there are gaps when the datapath intermittently can add one event to the event buffer but not the next. In this state the event circular buffer can end up with http requests without their corresponding responses, and responses without their corresponding requests. When this happens we'll accumulate cruft. Stuff allocated and added to the g_maplist structure and never freed, causing a second kind of leak. Possible improvements that can fix half of the problem for http1... on the datapath side if we can't add an http request to the circular buffer, don't even attempt to add the associated http response. Somewhat redundant with this, on the reporting side, don't create a map object for http responses, only requests

@jrcheli
Copy link
Contributor Author

jrcheli commented Jun 16, 2023

With the first commit, made sure we're doing the same free logic when the event circular buffer is full on the datapath side as when we're processing events on the reporting side.
Testing for this:

cd test/integration
make nginx-shell
unset SCOPE_EVENT_DEST SCOPE_METRIC_DEST
export SCOPE_EVENT_LOGFILE=false SCOPE_EVENT_CONSOLE=false SCOPE_EVENT_METRIC=false
apt update && apt install -y apache2-utils socat
scope -z nginx && top
In another terminal in the container:
ab -n 1000000 -c 3 http://localhost:80/

using top, observed peak RSS for nginx go from 8.5 G (before this commit) to .5G (after this commit) 

With the second commit, updated Makefille so unit tests can build successfully.
With the third commit, allowed http.resp events to be created even if a map object can not be found. Also moved a couple of potentially large allocations from the stack to heap.
Testing for this:

o) First, reran the testing for the first commit to verify that this commit does not have an effect on memory usage.  (no worse)

o) Second, reran the testing for the first commit again, but running this socat command in a third terminal into the container before starting nginx so the contents of the data could be verified.  `socat - TCP-LISTEN:9109,reuseaddr,fork | grep http.resp`.  What I was looking for is that some some http.resp events contain "target", while others do not.  ("target" is from the request).

o) Third, to verify that http metrics (always aggregated) still work...  I reran the testing for the first commit, but added some environment variables to nginx, and ran with a differently filtered socat command so I could look at metric output.  I was looking for any metric output from socat.
export SCOPE_CRIBL_ENABLE=true
export SCOPE_CRIBL=tcp://localhost:9109
Using `socat - TCP-LISTEN:9109,reuseaddr,fork | grep http | grep "\"type\":\"metric\""`

With the fourth commit, added code to break out of the doEvent() loop so other processing can not get starved indefinitely. Also refined logic to make sure that we are not processing http protocol data 1) if SCOPE_EVENT_ENABLE=false and SCOPE_EVENT_HTTP=true or SCOPE_METRIC_ENABLE=false and SCOPE_METRIC_HTTP=true. (Basically the "master enable" for events and metrics was being ignored before. Now it will be honored.
Testing for this:

cd test/integration
make nginx-shell
export SCOPE_EVENT_LOGFILE=false SCOPE_EVENT_CONSOLE=false SCOPE_EVENT_METRIC=false
export SCOPE_CRIBL_ENABLE=true
export SCOPE_CRIBL=tcp://localhost:9109
apt update && apt install -y socat
scope -z top -d 0.01

In a second terminal in the container, after top's RES memory seems to stop climbing, 
socat - TCP-LISTEN:9109,reuseaddr,fork 

Then in a third terminal into the container, as soon as socat starts receiving data, repeatedly run:
scope inspect `pidof top`

The thing we're trying to observe is that the scope inspect command continues to work without too many timeouts.  I think some timeouts are acceptable, but ideally we wouldn't see many of them.

@jrcheli
Copy link
Contributor Author

jrcheli commented Jun 26, 2023

Just to document the final state of #1521 (comment) above...

  1. Addressed by PR Fix mem leaks seen with http data #1551
  2. No action needed. It's really just an open ended question that doesn't require any change at this time.
  3. Will be addressed by Create ceiling for memory usage w.r.t. http requests and responses #1557

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in AppScope 1.4.0 Release (Summer 23) Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant